[jira] [Commented] (HBASE-7474) Endpoint Implementation to support Scans with Sorting of Rows based on column values(similar to "order by" clause of RDBMS)

Anil Gupta (JIRA) Wed, 02 Jan 2013 16:28:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542597#comment-13542597
 ]


Anil Gupta commented on HBASE-7474:
-----------------------------------

[~tlipcon]
Hi Todd,
Let's walk through an example and I hope you have gone through the doc attached 
to Jira.

Example: Table has 20 million rows divided among 10 regions(2 million rows per 
region). 
We want to sort on a column that stores Long Value and get the 20 max values. 
500k row are satisfied the 
scan filters.

Case1: If the scans don't span multiple regions then
Case 1.1: No Coprocessor
RegionServer needs to transfer 500K across the network to client.
Case 1.2: With Coprocessor
RegionServer will sort the top 20 among 500K rows and only return 20 rows.

Case2:If the scan spans multiple regions then lets assume the 250K rows in 
region1 and 250k rows in region2 are satisfied by scanner 
Case 1: No Coprocessor
Region1 will transfer 250K rows to client.
Region2 will transfer 250K rows to client.
Client will sort top 20 among 500K rows.
Case 2: With Coprocessor
Region1 will sort the top 20 among 250K rows and only return 20 rows to client.
Region2 will sort the top 20 among 250K rows and only return 20 rows to client.
Client will perform the merge sort on the results from region1 and region2.

The network I/O difference is huge. IMO, it is not possible to implement 
sorting in HBase without coprocessor. The client will keep on dying due to 
Network I/O and extreme memory load if we don't do server side processing.

I understand you concern that its an extra load on the server-side. But, 
currently there is no better way to achieve it.

If you have any other better idea to implement this in HBase, i would be glad 
to have a look at that.

Lastly, its a co-processor so it wont be enabled by default. User's who need it 
will enable this and they will do their due diligence in Tuning the cluster for 
their use case.   

Thanks,
Anil Gupta
Software Engineer II, Intuit, inc

                
> Endpoint Implementation to support Scans with Sorting of Rows based on column 
> values(similar to "order by" clause of RDBMS)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7474
>                 URL: https://issues.apache.org/jira/browse/HBASE-7474
>             Project: HBase
>          Issue Type: New Feature
>          Components: Coprocessors, Scanners
>    Affects Versions: 0.94.3
>            Reporter: Anil Gupta
>            Priority: Minor
>              Labels: coprocessors, scan, sort
>             Fix For: 0.94.5
>
>         Attachments: SortingEndpoint_high_level_flowchart.pdf
>
>
> Recently, i have developed an Endpoint which can sort the Results(rows) on 
> the basis of column values. This functionality is similar to "order by" 
> clause of RDBMS. I will be submitting this Patch for HBase0.94.3
> I am almost done with the initial development and testing of feature. But, i 
> need to write the JUnits for this. I will also try to make design doc.
> Thanks,
> Anil Gupta
> Software Engineer II, Intuit, inc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7474) Endpoint Implementation to support Scans with Sorting of Rows based on column values(similar to "order by" clause of RDBMS)

Reply via email to