[
https://issues.apache.org/jira/browse/PHOENIX-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karan Mehta updated PHOENIX-4912:
---------------------------------
Affects Version/s: 4.15.0
> Make Table Sampling algorithm to accommodate to the imbalance row
> distribution across guide posts
> -------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-4912
> URL: https://issues.apache.org/jira/browse/PHOENIX-4912
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 5.0.0, 4.15.0
> Reporter: Bin Shi
> Assignee: Bin Shi
> Priority: Major
>
> The current implementation of table sampling is based on the assumption
> "Every two consecutive guide posts contains the equal number of rows" which
> isn't accurate in practice, and once we collect multiple versions of cells
> and the deleted rows, the thing will become worse.
> In details, the current implementation of table sampling is (see
> BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end
> of function) as described below:
> # Iterate all parallel scans generated;
> # For each scan, if getHashHode(start row key of the scan) MOD 100 <
> tableSamplingRate (See TableSamplerPredicate.java) then pick this scan;
> otherwise discard this scan.
> The problem can be formalized as: We have a group of scans and each scan is
> defined as <the start row key denoted as Ki, the count of rows denoted as
> Ci>. Now we want to randomly pick X groups so that the sum of count of rows
> in the selected groups is close to Y, where Y = the total count of rows of
> all scans T * table sampling rate R.
> To resolve the above problem, one of algorithms that we can consider are
> described below:
> ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R)
> {
> ArrayList<Scan> pickedScans = new ArrayList<Scan>();
> Y = T * R;
> for (scan<Ki, Ci> in scans) {
> if (Y <= 0) break;
> if (getHashCode(Ki) MOD 100 < R) {
> // then pick this scan, and adjust T, R, Y accordingly
> pickedScans.Add(scan);
> T -= Ci;
> Y -= Ci;
> if (T != 0 && Y > 0) {
> R = Y / T;
> }
> }
> }
> return pickedScans;
> }
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)