[ 
https://issues.apache.org/jira/browse/PHOENIX-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved PHOENIX-7684.
-----------------------------------
    Resolution: Fixed

> Introduce Segment Scan
> ----------------------
>
>                 Key: PHOENIX-7684
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7684
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>             Fix For: 5.3.0
>
>
> HBase stores rows of data in tables. Tables are split into groups of 
> lexicographically adjacent rows. These groups are called regions. By 
> lexicographically adjacent, all rows in the table that sort between the 
> region’s start row key and end row key are stored in the same region. Large 
> table with multi tera bytes or peta bytes of data can have their data spread 
> across hundreds of thousands of regions.
> Depending on the size of the table, full table scan can take from few seconds 
> to several hours. When client executes "SELECT * FROM <table>" query using 
> Phoenix JDBC client, single client is responsible for retrieving all the rows 
> from all the regions of the given table. Although, single Phoenix client does 
> divide the scan range into the table region ranges and submit the scans in 
> parallel. Single client application can still become bottleneck for the end 
> user when the table size is very large, as the single client application 
> always have limited memory, CPU and IO capabilities.
> The purpose of this Jira is to introduce segment scan. The segment is logical 
> chunk of the given table, similar to HBase table region. While scanning the 
> full table, the client application does not have any insight into how the 
> table data are distributed among the regions. The concept of segment scan 
> allows application to define the number of segments into which the table data 
> can be divided. A new function TOTAL_SEGMENTS() can be used to retrieve the 
> scan boundary of each segment. The retrieved scan boundaries can be provided 
> to individual client worker (thread or VM) so that a given worker only 
> performs scan of a segment of the table. Additional functions 
> SCAN_START_KEY() and SCAN_END_KEY() are used to retrieve the segment 
> boundaries and then later on use the boundaries to submit the segment scan 
> request.
> {code:java}
> SELECT SCAN_START_KEY(), SCAN_END_KEY() FROM T1 WHERE TOTAL_SEGMENTS() = 10 
> {code}
> The above SELECT query is meant to retrieve total 10 segment boundaries from 
> the table T1 by bucketing the table region boundaries into 10 ranges.
> e.g.
>  * If the table has 12 regions and client needs 4 segments: each segment will 
> contain 3 regions
>  * If the table has 10 regions and client needs 3 segments: you get segments 
> of sizes 4, 3, and 3 regions
>  * If the table has 3 regions and client needs 10 segments: you get 3 
> segments (one per region)
> One of the advantages of the segment scan: External frameworks like Spark, 
> Trino or MapReduce can now ask for "give me exactly 100 work units" instead 
> of dealing with an unknown number of HBase regions. The segment scan approach 
> provides horizontal scaling of the client workers.
> After retrieving the above segment boundaries, any of the segment can be 
> scanned using
> {code:java}
> SELECT * FROM <table> WHERE SCAN_START_KEY() = ? AND SCAN_END_KEY() = ?{code}
> The above query takes param values of SCAN_START_KEY() and SCAN_END_KEY() as 
> the VARBINARY values earlier retrieved from the first query. With the scan 
> boundaries as the given segment boundary, only the specific segment worth of 
> data will be scanned by the query.
> Overall, this feature provides deterministic data partitioning for parallel 
> processing, making Phoenix more suitable for integration with modern big data 
> processing frameworks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to