[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-3420:
------------------------------

    Attachment: HIVE-3420.D7311.1.patch

navis requested code review of "HIVE-3420 [jira] Inefficiency in hbase handler 
when process query including rowkey range scan".
Reviewers: JIRA

  DPAL-1943 Inefficiency in hbase handler when process query including rowkey 
range scan

  When query hive with hbase rowkey range, hive map tasks do not leverage 
startrow, endrow information in tablesplit. For example, if the rowkeys fit 
into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
process 1 file. But in current implementation, each task processes 5 files 
repeatedly. The behavior not only waste network bandwidth, but also worse the 
lock contention in HBase block cache as each task have to access the same 
block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
  ……
      if (tableSplit != null)
  {
        tableSplit = new TableSplit(
          tableSplit.getTableName(),
          startRow,
          stopRow,
          tableSplit.getRegionLocation());
      }
      scan.setStartRow(startRow);
      scan.setStopRow(stopRow);
  ……
  As tableSplit already include startRow, endRow information of file, the 
better implementation will be:

          ……
          byte[] splitStart = startRow;
          byte[] splitStop = stopRow;
      if (tableSplit != null) {

             if(tableSplit.getStartRow() != null)
  {
                          splitStart = startRow.length == 0 ||
            Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
              tableSplit.getStartRow() : startRow;
                  }
                  if(tableSplit.getEndRow() != null)
  {
                          splitStop = (stopRow.length == 0 ||
            Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
            tableSplit.getEndRow().length > 0 ?
              tableSplit.getEndRow() : stopRow;
                  }

        tableSplit = new TableSplit(
          tableSplit.getTableName(),
          splitStart,
          splitStop,
          tableSplit.getRegionLocation());
      }
      scan.setStartRow(splitStart);
      scan.setStopRow(splitStop);
          ……
  In my test, the changed code will improve performance more than 30%.

TEST PLAN
  EMPTY

REVISION DETAIL
  https://reviews.facebook.net/D7311

AFFECTED FILES
  
hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java

MANAGE HERALD DIFFERENTIAL RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/17415/

To: JIRA, navis

                
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-3420
>                 URL: https://issues.apache.org/jira/browse/HIVE-3420
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>         Environment: Hive-0.9.0 + HBase-0.94.1
>            Reporter: Gang Deng
>            Priority: Critical
>         Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
>     if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(startRow);
>     scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
>         ……
>         byte[] splitStart = startRow;
>         byte[] splitStop = stopRow;
>     if (tableSplit != null) {
>                 
>            if(tableSplit.getStartRow() != null){
>                         splitStart = startRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
>             tableSplit.getStartRow() : startRow;
>                 }
>                 if(tableSplit.getEndRow() != null){
>                         splitStop = (stopRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>           tableSplit.getEndRow().length > 0 ?
>             tableSplit.getEndRow() : stopRow;
>                 }                       
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         splitStart,
>         splitStop,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(splitStart);
>     scan.setStopRow(splitStop);
>         ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to