[ 
https://issues.apache.org/jira/browse/HBASE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yubao Liu updated HBASE-25357:
------------------------------
    Description: 
Currently, spark hbase connector use `String` to specify regionStart and 
regionEnd, but we often have serialized binary row key,  I made a little patch 
at [https://github.com/apache/hbase-connectors/pull/72/files] to always treat 
the `String` in ISO_8859_1, so we can put raw bytes into the String object and 
get it unchanged.

This has a drawback,  if your row key is really Unicode strings beyond 
ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then 
encapsulate it in ISO_8859_1 string. This is a limitation of Spark option 
interface which allows only string to string map.
{code:java}
import java.nio.charset.StandardCharsets;

df.write()
  .format("org.apache.hadoop.hbase.spark")
  .option(HBaseTableCatalog.tableCatalog(), catalog)
  .option(HBaseTableCatalog.newTable(), 5)
  .option(HBaseTableCatalog.regionStart(), new 
String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .option(HBaseTableCatalog.regionEnd(), new 
String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .mode(SaveMode.Append)
  .save();
{code}

  was:
Currently, spark hbase connector use `String` to specify regionStart and 
regionEnd, but we often have serialized binary row key,  I made a little patch 
at [https://github.com/apache/hbase-connectors/pull/72/files] to always treat 
the `String` in ISO_8859_1, so we can put raw bytes into the String object and 
get it unchanged.

This has a drawback,  if your row key is really UTF-8 strings, you should 
convert it to UTF-8 encoded bytes and then encapsulate it in ISO_8859_1 string. 
This is a limitation of Spark option interface which allows only string to 
string map.
{code:java}
import java.nio.charset.StandardCharsets;

df.write()
  .format("org.apache.hadoop.hbase.spark")
  .option(HBaseTableCatalog.tableCatalog(), catalog)
  .option(HBaseTableCatalog.newTable(), 5)
  .option(HBaseTableCatalog.regionStart(), new 
String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .option(HBaseTableCatalog.regionEnd(), new 
String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
  .mode(SaveMode.Append)
  .save();
{code}


> allow specifying binary row key range to pre-split regions
> ----------------------------------------------------------
>
>                 Key: HBASE-25357
>                 URL: https://issues.apache.org/jira/browse/HBASE-25357
>             Project: HBase
>          Issue Type: Improvement
>          Components: spark
>            Reporter: Yubao Liu
>            Priority: Major
>
> Currently, spark hbase connector use `String` to specify regionStart and 
> regionEnd, but we often have serialized binary row key,  I made a little 
> patch at [https://github.com/apache/hbase-connectors/pull/72/files] to always 
> treat the `String` in ISO_8859_1, so we can put raw bytes into the String 
> object and get it unchanged.
> This has a drawback,  if your row key is really Unicode strings beyond 
> ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then 
> encapsulate it in ISO_8859_1 string. This is a limitation of Spark option 
> interface which allows only string to string map.
> {code:java}
> import java.nio.charset.StandardCharsets;
> df.write()
>   .format("org.apache.hadoop.hbase.spark")
>   .option(HBaseTableCatalog.tableCatalog(), catalog)
>   .option(HBaseTableCatalog.newTable(), 5)
>   .option(HBaseTableCatalog.regionStart(), new 
> String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
>   .option(HBaseTableCatalog.regionEnd(), new 
> String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1))
>   .mode(SaveMode.Append)
>   .save();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to