Viraj Jasani created PHOENIX-7684:
-------------------------------------
Summary: Introduce Segment Scan
Key: PHOENIX-7684
URL: https://issues.apache.org/jira/browse/PHOENIX-7684
Project: Phoenix
Issue Type: New Feature
Reporter: Viraj Jasani
HBase stores rows of data in tables. Tables are split into groups of
lexicographically adjacent rows. These groups are called regions. By
lexicographically adjacent, all rows in the table that sort between the
region’s start row key and end row key are stored in the same region. Large
table with multi tera bytes or peta bytes of data can have their data spread
across hundreds of thousands of regions.
Depending on the size of the table, full table scan can take from few seconds
to several hours. When client executes "SELECT * FROM <table>" query using
Phoenix JDBC client, single client is responsible for retrieving all the rows
from all the regions of the given table. Although, single Phoenix client does
divide the scan range into the table region ranges and submit the scans in
parallel. Single client application can still become bottleneck for the end
user when the table size is very large, as the single client application always
have limited memory, CPU and IO capabilities.
The purpose of this Jira is to introduce segment scan. The segment is logical
chunk of the given table, similar to HBase table region. While scanning the
full table, the client application does not have any insight into how the table
data are distributed among the regions. The concept of segment scan allows
application to define the number of segments into which the table data can be
divided. A new function TOTAL_SEGMENTS() can be used to retrieve the scan
boundary of each segment. The retrieved scan boundaries can be provided to
individual client worker (thread or VM) so that a given worker only performs
scan of a segment of the table. Additional functions SCAN_START_KEY() and
SCAN_END_KEY() are used to retrieve the segment boundaries and then later on
use the boundaries to submit the segment scan request.
{code:java}
SELECT SCAN_START_KEY(), SCAN_END_KEY() FROM T1 WHERE TOTAL_SEGMENTS() = 10
{code}
The above SELECT query is meant to retrieve total 10 segment boundaries from
the table T1 by bucketing the table region boundaries into 10 ranges.
e.g.
* If the table has 12 regions and client needs 4 segments: each segment will
contain 3 regions
* If the table has 10 regions and client needs 3 segments: you get segments of
sizes 4, 3, and 3 regions
* If the table has 3 regions and client needs 10 segments: you get 3 segments
(one per region)
One of the advantages of the segment scan: External frameworks like Spark,
Trino or MapReduce can now ask for "give me exactly 100 work units" instead of
dealing with an unknown number of HBase regions. The segment scan approach
provides horizontal scaling of the client workers.
After retrieving the above segment boundaries, any of the segment can be
scanned using
{code:java}
SELECT * FROM <table> WHERE SCAN_START_KEY() = ? AND SCAN_END_KEY() = ?{code}
The above query takes param values of SCAN_START_KEY() and SCAN_END_KEY() as
the VARBINARY values earlier retrieved from the first query. With the scan
boundaries as the given segment boundary, only the specific segment worth of
data will be scanned by the query.
Overall, this feature provides deterministic data partitioning for parallel
processing, making Phoenix more suitable for integration with modern big data
processing frameworks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)