[ https://issues.apache.org/jira/browse/PHOENIX-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani resolved PHOENIX-7684. ----------------------------------- Resolution: Fixed > Introduce Segment Scan > ---------------------- > > Key: PHOENIX-7684 > URL: https://issues.apache.org/jira/browse/PHOENIX-7684 > Project: Phoenix > Issue Type: New Feature > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Fix For: 5.3.0 > > > HBase stores rows of data in tables. Tables are split into groups of > lexicographically adjacent rows. These groups are called regions. By > lexicographically adjacent, all rows in the table that sort between the > region’s start row key and end row key are stored in the same region. Large > table with multi tera bytes or peta bytes of data can have their data spread > across hundreds of thousands of regions. > Depending on the size of the table, full table scan can take from few seconds > to several hours. When client executes "SELECT * FROM <table>" query using > Phoenix JDBC client, single client is responsible for retrieving all the rows > from all the regions of the given table. Although, single Phoenix client does > divide the scan range into the table region ranges and submit the scans in > parallel. Single client application can still become bottleneck for the end > user when the table size is very large, as the single client application > always have limited memory, CPU and IO capabilities. > The purpose of this Jira is to introduce segment scan. The segment is logical > chunk of the given table, similar to HBase table region. While scanning the > full table, the client application does not have any insight into how the > table data are distributed among the regions. The concept of segment scan > allows application to define the number of segments into which the table data > can be divided. A new function TOTAL_SEGMENTS() can be used to retrieve the > scan boundary of each segment. The retrieved scan boundaries can be provided > to individual client worker (thread or VM) so that a given worker only > performs scan of a segment of the table. Additional functions > SCAN_START_KEY() and SCAN_END_KEY() are used to retrieve the segment > boundaries and then later on use the boundaries to submit the segment scan > request. > {code:java} > SELECT SCAN_START_KEY(), SCAN_END_KEY() FROM T1 WHERE TOTAL_SEGMENTS() = 10 > {code} > The above SELECT query is meant to retrieve total 10 segment boundaries from > the table T1 by bucketing the table region boundaries into 10 ranges. > e.g. > * If the table has 12 regions and client needs 4 segments: each segment will > contain 3 regions > * If the table has 10 regions and client needs 3 segments: you get segments > of sizes 4, 3, and 3 regions > * If the table has 3 regions and client needs 10 segments: you get 3 > segments (one per region) > One of the advantages of the segment scan: External frameworks like Spark, > Trino or MapReduce can now ask for "give me exactly 100 work units" instead > of dealing with an unknown number of HBase regions. The segment scan approach > provides horizontal scaling of the client workers. > After retrieving the above segment boundaries, any of the segment can be > scanned using > {code:java} > SELECT * FROM <table> WHERE SCAN_START_KEY() = ? AND SCAN_END_KEY() = ?{code} > The above query takes param values of SCAN_START_KEY() and SCAN_END_KEY() as > the VARBINARY values earlier retrieved from the first query. With the scan > boundaries as the given segment boundary, only the specific segment worth of > data will be scanned by the query. > Overall, this feature provides deterministic data partitioning for parallel > processing, making Phoenix more suitable for integration with modern big data > processing frameworks. -- This message was sent by Atlassian Jira (v8.20.10#820010)