[
https://issues.apache.org/jira/browse/HBASE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yubao Liu updated HBASE-25357:
------------------------------
Description:
Currently, spark hbase connector use `String` to specify regionStart and
regionEnd, but we often have serialized binary row key, I made a little patch
at [https://github.com/apache/hbase-connectors/pull/72/files] to always treat
the `String` in ISO_8859_1, so we can put raw bytes into the String object and
get it unchanged.
This has a drawback, if your row key is really Unicode strings beyond
ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then
encapsulate it in ISO_8859_1 string. This is a limitation of Spark option
interface which allows only string to string map.
{code:java}
import java.nio.charset.StandardCharsets;
df.write()
.format("org.apache.hadoop.hbase.spark")
.option(HBaseTableCatalog.tableCatalog(), catalog)
.option(HBaseTableCatalog.newTable(), 5)
.option(HBaseTableCatalog.regionStart(), new
String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
.option(HBaseTableCatalog.regionEnd(), new
String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
.mode(SaveMode.Append)
.save();
{code}
was:
Currently, spark hbase connector use `String` to specify regionStart and
regionEnd, but we often have serialized binary row key, I made a little patch
at [https://github.com/apache/hbase-connectors/pull/72/files] to always treat
the `String` in ISO_8859_1, so we can put raw bytes into the String object and
get it unchanged.
This has a drawback, if your row key is really UTF-8 strings, you should
convert it to UTF-8 encoded bytes and then encapsulate it in ISO_8859_1 string.
This is a limitation of Spark option interface which allows only string to
string map.
{code:java}
import java.nio.charset.StandardCharsets;
df.write()
.format("org.apache.hadoop.hbase.spark")
.option(HBaseTableCatalog.tableCatalog(), catalog)
.option(HBaseTableCatalog.newTable(), 5)
.option(HBaseTableCatalog.regionStart(), new
String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
.option(HBaseTableCatalog.regionEnd(), new
String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
.mode(SaveMode.Append)
.save();
{code}
> allow specifying binary row key range to pre-split regions
> ----------------------------------------------------------
>
> Key: HBASE-25357
> URL: https://issues.apache.org/jira/browse/HBASE-25357
> Project: HBase
> Issue Type: Improvement
> Components: spark
> Reporter: Yubao Liu
> Priority: Major
>
> Currently, spark hbase connector use `String` to specify regionStart and
> regionEnd, but we often have serialized binary row key, I made a little
> patch at [https://github.com/apache/hbase-connectors/pull/72/files] to always
> treat the `String` in ISO_8859_1, so we can put raw bytes into the String
> object and get it unchanged.
> This has a drawback, if your row key is really Unicode strings beyond
> ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then
> encapsulate it in ISO_8859_1 string. This is a limitation of Spark option
> interface which allows only string to string map.
> {code:java}
> import java.nio.charset.StandardCharsets;
> df.write()
> .format("org.apache.hadoop.hbase.spark")
> .option(HBaseTableCatalog.tableCatalog(), catalog)
> .option(HBaseTableCatalog.newTable(), 5)
> .option(HBaseTableCatalog.regionStart(), new
> String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
> .option(HBaseTableCatalog.regionEnd(), new
> String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
> .mode(SaveMode.Append)
> .save();
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)