[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774096#comment-13774096 ] Hudson commented on HIVE-3420: -- FAILURE: Integrated in Hive-trunk-hadoop2 #451 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/451/]) HIVE-3420 : Inefficiency in hbase handler when process query including rowkey range scan (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329) * /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Assignee: Navis >Priority: Critical > Fix For: 0.13.0 > > Attachments: HIVE-3420.D7311.1.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774003#comment-13774003 ] Hudson commented on HIVE-3420: -- SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #179 (See [https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/179/]) HIVE-3420 : Inefficiency in hbase handler when process query including rowkey range scan (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329) * /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Assignee: Navis >Priority: Critical > Fix For: 0.13.0 > > Attachments: HIVE-3420.D7311.1.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773917#comment-13773917 ] Phabricator commented on HIVE-3420: --- ashutoshc has accepted the revision "HIVE-3420 [jira] Inefficiency in hbase handler when process query including rowkey range scan". +1 REVISION DETAIL https://reviews.facebook.net/D7311 BRANCH DPAL-1943 ARCANIST PROJECT hive To: JIRA, ashutoshc, navis > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Assignee: Navis >Priority: Critical > Attachments: HIVE-3420.D7311.1.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685109#comment-13685109 ] Navis commented on HIVE-3420: - For a table with two region, executed simple query with PPD-able predicates, input rows (without patch) : 2,472 rows input rows (with patch) : 1,236 rows For large hbase table, it can make a big difference. > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Assignee: Navis >Priority: Critical > Attachments: HIVE-3420.D7311.1.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529641#comment-13529641 ] Navis commented on HIVE-3420: - @Gang Deng This is pretty important issue. I'll make a patch for a review. > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.9.0 > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Priority: Critical > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan
[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482075#comment-13482075 ] Lianhui Wang commented on HIVE-3420: @Gong Deng yes,i agree with you.in InputFormat getRecordReader() tableSplit = convertFilter(jobConf, scan, tableSplit, iKey, getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec, jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string"))); it have done tableSplit = new TableSplit( tableSplit.getTableName(), startRow, stopRow, tableSplit.getRegionLocation(), tableSplit.getConf()); also in getplits(),a tableSplit lead to a regionLocation task.now that splits have not any effect. so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit. IMO,the convertFilter() logic code used in many places.for example: HBaseStorageHandler.decomposePredicate() HiveHBaseTableInputFormat.getSplits() HiveHBaseTableInputFormat.getRecordReader() i think there need one place to use it. in HBaseStorageHandler.decomposePredicate().and that can store row key ranges. and then HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader() according to table's regioninfo split the key ranges tasks. other have ideas?thx. > Inefficiency in hbase handler when process query including rowkey range scan > > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 0.9.0 > Environment: Hive-0.9.0 + HBase-0.94.1 >Reporter: Gang Deng >Priority: Critical > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > >if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira