[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529641#comment-13529641 ]
Navis commented on HIVE-3420: ----------------------------- @Gang Deng This is pretty important issue. I'll make a patch for a review. > Inefficiency in hbase handler when process query including rowkey range scan > ---------------------------------------------------------------------------- > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Affects Versions: 0.9.0 > Environment: Hive-0.9.0 + HBase-0.94.1 > Reporter: Gang Deng > Priority: Critical > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > > if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira