[
https://issues.apache.org/jira/browse/DRILL-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016674#comment-16016674
]
Jinfeng Ni commented on DRILL-5480:
-----------------------------------
Another case where hbase has 3 regions with the last region being empty. The
HBase API returns empty result from the empty region, and that also force Drill
to inject nullable-int column, and lead schema change conflicts in
UnorderedReceiver.
{code}
create 'customer2', 'orders', {SPLITS => ['3', '6']}
put 'customer2', '1', 'orders:id', '100'
put 'customer2', '5', 'orders:id', '200'
alter session set `planner.enable_broadcast_join`=false;
select * from hbase.customer2 c, cp.`tpch/orders.parquet` o where
cast(c.orders.id as bigint) = o.o_orderkey and c.orders.id <= '500';
{code}
{code}
select * from hbase.customer2 c, cp.`tpch/orders.parquet` o where
cast(c.orders.id as bigint) = o.o_orderkey and c.orders.id <= '500';
Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.
Expected vector class of org.apache.drill.exec.vector.NullableIntVector but was
holding vector class org.apache.drill.exec.vector.complex.MapVector, field=
orders(MAP:REQUIRED) [id(VARBINARY:OPTIONAL) [$bits$(UINT1:REQUIRED),
id(VARBINARY:OPTIONAL) [$offsets$(UINT4:REQUIRED)]]]
Fragment 2:0
{code}
Note that the filter {{c.orders.id <= '500'}} is always true for the first 2
regions, and does not prune anything.
> Empty batch returning from HBase may cause SchemChangeException or incorrect
> query result
> -----------------------------------------------------------------------------------------
>
> Key: DRILL-5480
> URL: https://issues.apache.org/jira/browse/DRILL-5480
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
>
> The following repo was provided by [~haozhu].
> 1. Create a Hbase table with 4 regions
> {code}
> create 'myhbase', 'cf1','cf2', {SPLITS => ['a', 'b', 'c']}
> put 'myhbase','a','cf1:col1','somedata'
> put 'myhbase','b','cf1:col2','somedata'
> put 'myhbase','c','cf2:col1','somedata'
> {code}
> One region has cf1.col1. One region has column family 'cf1', but does not
> have 'col1' under 'cf1'. One region has only column family 'cf2'. And last
> region is complete empty.
> 2. Prepare a csv file.
> {code}
> select * from dfs.tmp.`joinhbase.csv`;
> +-------------------+
> | columns |
> +-------------------+
> | ["1","somedata"] |
> | ["2","somedata"] |
> | ["3","somedata"] |
> {code}
> Now run the following query on drill 1.11.0-SNAPSHOT:
> {code}
> select cast(H.row_key as varchar(10)) as keyCol, CONVERT_FROM(H.cf1.col1,
> 'UTF8') as col1
> from
> hbase.myhbase H JOIN dfs.tmp.`joinhbase.csv` C
> ON CONVERT_FROM(H.cf1.col1, 'UTF8')= C.columns[1]
> ;
> {code}
> The correct query result show be:
> {code}
> +---------+-----------+
> | keyCol | col1 |
> +---------+-----------+
> | a | somedata |
> | a | somedata |
> | a | somedata |
> +---------+-----------+
> {code}
> Turn off broadcast join, then we will see SchemaChangeException, or incorrect
> result randomly. By 'randomly', it means in the same session, the same query
> would hit SchemaChangeException in one run, while gets incorrect result in a
> second run.
> {code}
> alter session set `planner.enable_broadcast_join`=false;
> {code}
> {code}
> select cast(H.row_key as varchar(10)) as keyCol, CONVERT_FROM(H.cf1.col1,
> 'UTF8') as col1
> . . . . . . . . . . . . . . . . . .> from
> . . . . . . . . . . . . . . . . . .> hbase.myhbase H JOIN
> dfs.tmp.`joinhbase.csv` C
> . . . . . . . . . . . . . . . . . .> ON CONVERT_FROM(H.cf1.col1, 'UTF8')=
> C.columns[1]
> . . . . . . . . . . . . . . . . . .> ;
> Error: SYSTEM ERROR: SchemaChangeException: Hash join does not support schema
> changes
> {code}
> {code}
> +---------+-------+
> | keyCol | col1 |
> +---------+-------+
> +---------+-------+
> No rows selected (0.302 seconds)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)