[
https://issues.apache.org/jira/browse/HBASE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-25357:
-----------------------------------
Labels: pull-request-available (was: )
> allow specifying binary row key range to pre-split regions
> ----------------------------------------------------------
>
> Key: HBASE-25357
> URL: https://issues.apache.org/jira/browse/HBASE-25357
> Project: HBase
> Issue Type: Improvement
> Components: spark
> Reporter: Yubao Liu
> Priority: Major
> Labels: pull-request-available
>
> Currently, spark hbase connector use `String` to specify regionStart and
> regionEnd, but we often have serialized binary row key, I made a little
> patch at [https://github.com/apache/hbase-connectors/pull/72/files] to always
> treat the `String` in ISO_8859_1, so we can put raw bytes into the String
> object and get it unchanged.
> This has a drawback, if your row key is really Unicode strings beyond
> ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then
> encapsulate it in ISO_8859_1 string. This is a limitation of Spark option
> interface which allows only string to string map.
> {code:java}
> import java.nio.charset.StandardCharsets;
> df.write()
> .format("org.apache.hadoop.hbase.spark")
> .option(HBaseTableCatalog.tableCatalog(), catalog)
> .option(HBaseTableCatalog.newTable(), 5)
> .option(HBaseTableCatalog.regionStart(), new
> String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
> .option(HBaseTableCatalog.regionEnd(), new
> String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
> .mode(SaveMode.Append)
> .save();
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)