[ 
https://issues.apache.org/jira/browse/DRILL-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187208#comment-16187208
 ] 

ASF GitHub Bot commented on DRILL-5830:
---------------------------------------

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/968

    DRILL-5830: Resolve regressions to MapR DB from DRILL-5546

    DRILL-5546 fixed a wide variety of "empty batch" problems. But, it 
introduced a regression in the HBase and MapR-DB binary storage plugins. This 
PR refines the fix to resolve those regressions.
    
    Prior to DRILL-5546, HBase provided a project push-down rule to expand 
wildcard columns. However, a bug in the push-down rule prevented proper 
function. DRILL-5546 fixed that rule. But, DRILL-5546 also explicitly expanded 
wildcards for HBase, which turned out to be redundant, and so is backed out in 
this PR.
    
    Wildcard expansion in the HBase storage plugin was meant to overcome the 
schema change conflict that occurs with empty regions. In such regions, we get 
the row key and the column family as an empty map. In regions with data, we get 
the row key and a non-empty map for the column family. Examples:
    
    * Empty: (row_key, cf{})
    * Non-empty: (row_key, cf{col1, col2})
    
    Where cf is a column family and col1, col2 are columns.
    
    It turns out that the receivers were getting confused. The 
`RecordBatchLoader` class treated empty and non-empty maps as an identical 
schema. This was mentioned in  DRILL-5546:
    
    > In HBase a column family always has map type, and a non-rowkey column 
always has nullable varbinary type, this ensures that HBaseRecordReader across 
different HBase regions will have the same top level schema, even if the region 
is empty or prune all the rows due to filter pushdown optimization. In other 
words, we will not see different top level schema from different 
HBaseRecordReader for the same table.
    
    The problem is, a difference in map content really is a schema change, so 
we need to detect and report it. This PR makes that change.
    
    Now, as it turns out, changes made by DRILL-5546 to the top-level project 
operator gracefully removes the empty batches (with empty maps), passing along 
just the non-empty batches (with non-empty maps.)
    
    In short, this PR:
    
    * Backs out the HBase-specific changes to DRILL-5546,
    * Fixes the schema change issue in `RecordBatchLoader`,
    * Adds unit tests for the fixes to `RecordBatchLoader`,
    * Does a number of minor code cleanups.
    
    The result is that the HBase problems that DRILL-5546 solved are still 
solved, but the regressions to MapR DB binary are also fixed.
    
    We propose to modify the MapR DB binary storage plugin to do projection 
push-down the same way as is done in HBase, but that will be a separate PR.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5830

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #968
    
----
commit 6cb5e98c6cccfc519cd7413248685ddc42af96b4
Author: Paul Rogers <[email protected]>
Date:   2017-09-28T16:49:38Z

    Back out HBase changes

commit 333bd1b36e72950d926c899b9050ddfae09fc817
Author: Paul Rogers <[email protected]>
Date:   2017-09-30T01:14:55Z

    Code cleanup

commit 8baca87708af2a38c5748a1a0435312e34e90903
Author: Paul Rogers <[email protected]>
Date:   2017-09-30T01:18:16Z

    Test utilities

commit 2ce7bf76dc37393b0326e23db99706f2abea7f5c
Author: Paul Rogers <[email protected]>
Date:   2017-09-30T01:18:48Z

    Fix for DRILL-5829

commit f660731df0168304456976e22687459b41d35546
Author: Paul Rogers <[email protected]>
Date:   2017-09-30T20:58:22Z

    Code cleanup

----


> Resolve regressions to MapR DB from DRILL-5546
> ----------------------------------------------
>
>                 Key: DRILL-5830
>                 URL: https://issues.apache.org/jira/browse/DRILL-5830
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> DRILL-5546 added a number of fixes for empty batches. One part of the fix was 
> for HBase. Key changes:
> * Add code to expand wildcards in the planner. (i.e. SELECT *)
> * Remove support for wildcards in the HBase record reader.
> As noted in DRILL-5775, this change had the effect of breaking support for 
> MapR-DB binary (which is API compatible with HBase.) DRILL-5775 does this by 
> expanding wildcards in the planner for MapR DB as was done for HBase in 
> DRILL-5546.
> Unfortunately, this change introduced other regressions into the code as 
> described by DRILL-5706.
> Investigation of those issues revealed that we should back out the original 
> DRILL-5546 changes and go down a different route.
> As it turns out, HBase already had a project push-down rule that expanded 
> wildcards. However, that rule didn't work correctly some of the time. 
> DRILL-5546 fixed that bug, ensuring that wildcards are expanded (at least in 
> the cases tested for this ticket.)
> The actual issue turned out to be a bug in the {{RecordBatchLoader}} class 
> which did not consider map contents when detecting schema change. As a 
> result, results like (row_key, cf\{}) were treated the same as (row_key, 
> cf\{mycol}) and the actual data colums were discarded, but randomly depending 
> on batch arrival order.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to