[
https://issues.apache.org/jira/browse/DRILL-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated DRILL-5825:
-------------------------------
Description:
The HBase storage plugin has the ability to expand wildcard columns at plan
time as part of projection push-down. HBase column names are encoded in binary.
Drill decodes the names using the default character set of the machine running
Drill. While this works fine for ASCII names, there is no reason to believe it
will work for non-ASCII names if the machine running Drill is different than
the machine on which the column name was defined.
Offending code:
{code}
public class DrillHBaseTable extends DrillTable implements DrillHBaseConstants {
...
fieldNameList.add(Bytes.toString(family));
{code}
>From {{String}} Javadoc:
bq. String(byte[] bytes): Constructs a new String by decoding the specified
array of bytes using the *platform's* default charset.
Note, however, that the above is inconsistent with how names are handled in the
{{HBaseRecordReader}} class:
{code}
final String familyName = new String(familyEntry.getKey(),
StandardCharsets.UTF_8);
{code}
But, in the very same file, we use the system encoding again:
{code}
protected Collection<SchemaPath> transformColumns(Collection<SchemaPath>
columns) {
...
NameSegment root = column.getRootSegment();
byte[] family = root.getPath().getBytes();
...
hbaseScan.addFamily(family);
{code}
In the {{HBaseRecordReader}}, we're back to using the platform encoding:
{code}
final MapVector mv = getOrCreateFamilyVector(new String(familyArray,
familyOffset, familyLength), true);
...
final NullableVarBinaryVector v = getOrCreateColumnVector(mv,
new String(qualifierArray, qualifierOffset, qualifierLength));
{code}
Bottom line: we should determine the encoding that HBase uses, and use that,
rather than hoping that the Drillbit machine's default encoding happens to be
the right one.
was:
The HBase storage plugin has the ability to expand wildcard columns at plan
time as part of projection push-down. HBase column names are encoded in binary.
Drill decodes the names using the default character set of the machine running
Drill. While this works fine for ASCII names, there is no reason to believe it
will work for non-ASCII names if the machine running Drill is different than
the machine on which the column name was defined.
Offending code:
{code}
public class DrillHBaseTable extends DrillTable implements DrillHBaseConstants {
...
fieldNameList.add(Bytes.toString(family));
{code}
>From {{String}} Javadoc:
bq. String(byte[] bytes): Constructs a new String by decoding the specified
array of bytes using the *platform's* default charset.
Note, however, that the above is inconsistent with how names are handled in the
{{HBaseRecordReader}} class:
{code}
final String familyName = new String(familyEntry.getKey(),
StandardCharsets.UTF_8);
{code}
But, in the very same file, we use the system encoding again:
{code}
protected Collection<SchemaPath> transformColumns(Collection<SchemaPath>
columns) {
...
NameSegment root = column.getRootSegment();
byte[] family = root.getPath().getBytes();
...
hbaseScan.addFamily(family);
{code}
> HBase storage plugin assumes local system encoding for column names
> -------------------------------------------------------------------
>
> Key: DRILL-5825
> URL: https://issues.apache.org/jira/browse/DRILL-5825
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
>
> The HBase storage plugin has the ability to expand wildcard columns at plan
> time as part of projection push-down. HBase column names are encoded in
> binary. Drill decodes the names using the default character set of the
> machine running Drill. While this works fine for ASCII names, there is no
> reason to believe it will work for non-ASCII names if the machine running
> Drill is different than the machine on which the column name was defined.
> Offending code:
> {code}
> public class DrillHBaseTable extends DrillTable implements
> DrillHBaseConstants {
> ...
> fieldNameList.add(Bytes.toString(family));
> {code}
> From {{String}} Javadoc:
> bq. String(byte[] bytes): Constructs a new String by decoding the specified
> array of bytes using the *platform's* default charset.
> Note, however, that the above is inconsistent with how names are handled in
> the {{HBaseRecordReader}} class:
> {code}
> final String familyName = new String(familyEntry.getKey(),
> StandardCharsets.UTF_8);
> {code}
> But, in the very same file, we use the system encoding again:
> {code}
> protected Collection<SchemaPath> transformColumns(Collection<SchemaPath>
> columns) {
> ...
> NameSegment root = column.getRootSegment();
> byte[] family = root.getPath().getBytes();
> ...
> hbaseScan.addFamily(family);
> {code}
> In the {{HBaseRecordReader}}, we're back to using the platform encoding:
> {code}
> final MapVector mv = getOrCreateFamilyVector(new
> String(familyArray, familyOffset, familyLength), true);
> ...
> final NullableVarBinaryVector v = getOrCreateColumnVector(mv,
> new String(qualifierArray, qualifierOffset, qualifierLength));
> {code}
> Bottom line: we should determine the encoding that HBase uses, and use that,
> rather than hoping that the Drillbit machine's default encoding happens to be
> the right one.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)