[ 
https://issues.apache.org/jira/browse/PHOENIX-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629258#comment-16629258
 ] 

Bin Shi edited comment on PHOENIX-4912 at 9/26/18 8:52 PM:
-----------------------------------------------------------

There are the following downsides of using the server side approach and 
reservoir sampling algorithm:
 # The time complexity will O(n) instead of O(m), where n is the number of rows 
satisfying the given predicate and m is n * table sampling rate x, so n >> m as 
x is always small. A table sampling query could be much heavier with the client 
side approach.
 # This approach needs to coordinate the sampling across multiple scans, so it 
might need to be serial plan otherwise extra complexity will be introduced for 
parallel plan. 

Microsoft ADLA (Azure Data Lake Analysis) Cloud uses the client approach – it 
uses Reservoir Sampling Algorithm on extent level (the same level of Block on 
HDFS, 128 or 256 MB by default). I believe the sampling query from Pig and Hive 
based on HDFS is also on HDFS Block level.

I second the current table sampling algorithm based on Scans/GuidePosts. 


was (Author: bin shi):
By using the server side approach, the time complexity will O(n) instead of 
O(m), where n is the number of rows to scan and m is the number of scans, so n 
>> m. A table sampling query could be much heavier than the client side 
approach.

Microsoft Cloud (ADLA) uses the client approach – it uses Reservoir Sampling 
Algorithm on extent level (the same level of Block on HDFS, 128 or 256 MB by 
default). I believe the sampling query from Pig and Hive based on HDFS is also 
on HDFS Block level.

I second the current table sampling algorithm based on Scans/GuidePosts. 

> Make Table Sampling algorithm to accommodate to the imbalance row 
> distribution across guide posts
> -------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-4912
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4912
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.15.0
>            Reporter: Bin Shi
>            Assignee: Bin Shi
>            Priority: Major
>
> The current implementation of table sampling is based on the assumption 
> "Every two consecutive guide posts contains the equal number of rows" which 
> isn't accurate in practice, and once we collect multiple versions of cells 
> and the deleted rows, the thing will become worse.
> In details, the current implementation of table sampling is (see 
> BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end 
> of function) as described below:
>  # Iterate all parallel scans generated;
>  # For each scan, if getHashHode(start row key of the scan) MOD 100 < 
> tableSamplingRate (See TableSamplerPredicate.java) then pick this scan; 
> otherwise discard this scan.
> The problem can be formalized as: We have a group of scans and each scan is 
> defined as <the start row key denoted as Ki, the count of rows denoted as 
> Ci>. Now we want to randomly pick X groups so that the sum of count of rows 
> in the selected groups is close to Y, where Y = the total count of rows of 
> all scans denoted as T * table sampling rate denoted as R (0 <= R <= 100) / 
> 100.00.
> To resolve the above problem, one of algorithms that we can consider are 
> described below. The core idea is to adjust T, R, Y after each pick, so the 
> new problem is a child problem of the original problem.
> {code:java}
> ArrayList<Scan> TableSampling(ArrayList<Scan> scans, T, R) {  
>     ArrayList<Scan> pickedScans = new ArrayList<Scan>();
>     Y = T * R / 100.00;
>     for (scan<Ki, Ci> in scans) {
>         if (Y <= 0) break;
>         if (getHashCode(Ki) MOD 100 < R) {
>             // then pick this scan, and adjust T, R, Y accordingly
>             pickedScans.Add(scan);
>             T -= Ci;
>             Y -= Ci;
>             if (T != 0 && Y > 0) { 
>                 R = 100.00 * Y / T;
>             }
>         }
>     }
>     return pickedScans;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to