[jira] [Commented] (PHOENIX-4594) Perform binary search on guideposts during query compilation

Bin Shi (JIRA) Thu, 25 Oct 2018 21:43:38 -0700


    [ 
https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664645#comment-16664645
 ]


Bin Shi commented on PHOENIX-4594:
----------------------------------

[~lhofhansl], let me continue the discussion of the prefix encoding.

I evaluated the benefit of compressing guideposts in prefix encoding. You can 
find details at 
[https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.]
 

Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*

I mainly evaluated the benefit by using the following typical types of data:
 # Case 1: Primary Key is Sequence in INT (4 Bytes)
 # When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size 
actually increased 7.14% after compression.
 # When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size 
actually increased 3.6% after compression.


 # Case 2: Primary Key is Sequence in BIGINT (8 Bytes)
 # When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced 
6.25% after compression.
 # When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced 
increased 9.4% after compression.


 # Case 3: Real Data From Platform Team

With the data known so far, after compression with prefix encoding, the lower 
bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously 
refine the calculation in this part after I know more about the real data.
 # Case 4: Primary Key is Reverse URL

This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have 
it. I don’t have real data for this case, but intuitively, this might be one of 
typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
 # We should allow customer to choose different compression algorithms or 
encoding schemes, and make it configurable.

Obviously, case 1 and case 2 are negative cases. As Jacob pointed out, 
double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix 
encoding mightn’t the best one to make tradeoff between performance and 
compression ratio.
 # We should split guideposts in chunks and always encoding/decoding a chunk as 
a whole while allowing random access across chunks. In this way, we can only 
cache/fetch part of guideposts of the table and facilitate tenant/view specific 
query.

> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4594
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4594
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Bin Shi
>            Priority: Major
>         Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch, 
> PHOENIX-4594_0918.patch
>
>
> If there are many guideposts, performance will suffer during query 
> compilation because we do a linear search of the guideposts to find the 
> intersection with the scan ranges. Instead, in 
> BaseResultIterators.getParallelScans() we should populate an array of 
> guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PHOENIX-4594) Perform binary search on guideposts during query compilation

Reply via email to