[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Navis (JIRA) Tue, 11 Dec 2012 21:53:26 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529641#comment-13529641
 ]


Navis commented on HIVE-3420:
-----------------------------

@Gang Deng 
This is pretty important issue. I'll make a patch for a review.
                
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-3420
>                 URL: https://issues.apache.org/jira/browse/HIVE-3420
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 0.9.0
>         Environment: Hive-0.9.0 + HBase-0.94.1
>            Reporter: Gang Deng
>            Priority: Critical
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
>     if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(startRow);
>     scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
>         ……
>         byte[] splitStart = startRow;
>         byte[] splitStop = stopRow;
>     if (tableSplit != null) {
>                 
>            if(tableSplit.getStartRow() != null){
>                         splitStart = startRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
>             tableSplit.getStartRow() : startRow;
>                 }
>                 if(tableSplit.getEndRow() != null){
>                         splitStop = (stopRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>           tableSplit.getEndRow().length > 0 ?
>             tableSplit.getEndRow() : stopRow;
>                 }                       
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         splitStart,
>         splitStop,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(splitStart);
>     scan.setStopRow(splitStop);
>         ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Reply via email to