Paul Rogers created DRILL-5826:
----------------------------------
Summary: UnorderedReceiverBatch fails to detect a schema change
Key: DRILL-5826
URL: https://issues.apache.org/jira/browse/DRILL-5826
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.11.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Run the following HBase query using:
{code}
select * from `hbase`.browser_action2 a
{code}
Table is defined as:
{code}
> create 'browser_action2', 'v', {SPLITS =>
> ['0','1','2','3','4','5','6','7','8','9']}
...
> scan 'browser_action2'
ROW COLUMN+CELL
1 column=v:e0, timestamp=1506560555979,
value=abc1
2 column=v:e0, timestamp=1506560564807,
value=abc2
{code}
Step through the {{UnorderedReceiverBatch}} with a parallelization of 1.
Observe the following (behavior is random):
* The first batch has schema (row_key, v) where v is an empty map
(corresponding to a column family), but no data (zero rows.)
* Because the first batch has columns, it is sent downstream with
{{OK_NEW_SCHEMA}}.
* The second batch has schema (row_key, v{e0}), where v is a map with column e0
(corresponding to a column family with one column) and one row.
* The code loads the batch, asking the batch itself if it has a new schema.
* The batch does not have a new schema so returns false.
* The {{UnorderedReceiverBatch}} returns {OK}, indicating to the downstream
operator that the second batch has the same schema as the first (which, in this
case, turns out to not be true.)
Code in question:
{code}
final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
{code}
In point of fact, each sender has no visibility to the schema of other senders,
and the order of receiving batches is undefined. Therefore, an input batch has
no way of knowing if it has the same schema as the previous output batch.
The obvious, correct, logic is to compare the incoming batch schema with the
current receiver schema, and send {{OK}} or {{OK_NEW_SCHEMA}} based on the
result of that comparison.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)