[
https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664645#comment-16664645
]
Bin Shi commented on PHOENIX-4594:
----------------------------------
[~lhofhansl], let me continue the discussion of the prefix encoding.
I evaluated the benefit of compressing guideposts in prefix encoding. You can
find details at
[https://docs.google.com/document/d/1NIS65g-CKY5HEmUkQZvCbHFzUykLb8KPHX-9i65XVwk/edit?ts=5bd23e7c#.]
Below is a summary.
h1. *Summary of Evaluation*
h2. *Summary*
I mainly evaluated the benefit by using the following typical types of data:
# Case 1: Primary Key is Sequence in INT (4 Bytes)
# When GUIDEPOSTS_WIDTH is 100MB, even in the ideal case, the data size
actually increased 7.14% after compression.
# When GUIDEPOSTS_WIDTH is 10MB, even in the ideal case, the data size
actually increased 3.6% after compression.
# Case 2: Primary Key is Sequence in BIGINT (8 Bytes)
# When GUIDEPOSTS_WIDTH is 100MB, in the ideal case, the data size reduced
6.25% after compression.
# When GUIDEPOSTS_WIDTH is 10MB, in the ideal case, the data size reduced
increased 9.4% after compression.
# Case 3: Real Data From Platform Team
With the data known so far, after compression with prefix encoding, the lower
bound of size reduced is roughly in the range 10% ~ 45%. I’ll continuously
refine the calculation in this part after I know more about the real data.
# Case 4: Primary Key is Reverse URL
This is a typical use case of BigTable/HBase, whereas Salesforce mightn’t have
it. I don’t have real data for this case, but intuitively, this might be one of
typical cases that Prefix Encoding can achieve the most benefit.
h2. *Takeaway*
# We should allow customer to choose different compression algorithms or
encoding schemes, and make it configurable.
Obviously, case 1 and case 2 are negative cases. As Jacob pointed out,
double-delta encoding should be used here. Even for Case 3 and Case 4,, prefix
encoding mightn’t the best one to make tradeoff between performance and
compression ratio.
# We should split guideposts in chunks and always encoding/decoding a chunk as
a whole while allowing random access across chunks. In this way, we can only
cache/fetch part of guideposts of the table and facilitate tenant/view specific
query.
> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
> Key: PHOENIX-4594
> URL: https://issues.apache.org/jira/browse/PHOENIX-4594
> Project: Phoenix
> Issue Type: Improvement
> Reporter: James Taylor
> Assignee: Bin Shi
> Priority: Major
> Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch,
> PHOENIX-4594_0918.patch
>
>
> If there are many guideposts, performance will suffer during query
> compilation because we do a linear search of the guideposts to find the
> intersection with the scan ranges. Instead, in
> BaseResultIterators.getParallelScans() we should populate an array of
> guideposts and perform a binary search.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)