[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2013-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774096#comment-13774096
 ] 

Hudson commented on HIVE-3420:
--

FAILURE: Integrated in Hive-trunk-hadoop2 #451 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/451/])
HIVE-3420 : Inefficiency in hbase handler when process query including rowkey 
range scan (Navis via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329)
* 
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java


> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Assignee: Navis
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2013-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774003#comment-13774003
 ] 

Hudson commented on HIVE-3420:
--

SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #179 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/179/])
HIVE-3420 : Inefficiency in hbase handler when process query including rowkey 
range scan (Navis via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329)
* 
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java


> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Assignee: Navis
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2013-09-21 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773917#comment-13773917
 ] 

Phabricator commented on HIVE-3420:
---

ashutoshc has accepted the revision "HIVE-3420 [jira] Inefficiency in hbase 
handler when process query including rowkey range scan".

  +1

REVISION DETAIL
  https://reviews.facebook.net/D7311

BRANCH
  DPAL-1943

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, navis


> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2013-06-17 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685109#comment-13685109
 ] 

Navis commented on HIVE-3420:
-

For a table with two region, executed simple query with PPD-able predicates,
input rows (without patch) : 2,472 rows 
input rows (with patch) : 1,236 rows

For large hbase table, it can make a big difference.

> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2012-12-11 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529641#comment-13529641
 ] 

Navis commented on HIVE-3420:
-

@Gang Deng 
This is pretty important issue. I'll make a patch for a review.

> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.9.0
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Priority: Critical
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2012-10-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482075#comment-13482075
 ] 

Lianhui Wang commented on HIVE-3420:


@Gong Deng
yes,i agree with you.in InputFormat getRecordReader()
tableSplit = convertFilter(jobConf, scan, tableSplit, iKey,
  getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec,
  jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string")));
it have done
tableSplit = new TableSplit(
tableSplit.getTableName(),
startRow,
stopRow,
tableSplit.getRegionLocation(),
tableSplit.getConf());
also in getplits(),a tableSplit lead to a regionLocation task.now that splits 
have not any effect. 
so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit.

IMO,the convertFilter() logic code used in many places.for example:
HBaseStorageHandler.decomposePredicate()
HiveHBaseTableInputFormat.getSplits()
HiveHBaseTableInputFormat.getRecordReader()

i think there need one place to use it. in 
HBaseStorageHandler.decomposePredicate().and that can store row key ranges.
and then 
HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader()
 according to table's regioninfo split the key ranges tasks.

other have ideas?thx.



> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.9.0
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Priority: Critical
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira