[
https://issues.apache.org/jira/browse/PHOENIX-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879767#comment-16879767
]
Lars Hofhansl commented on PHOENIX-4925:
----------------------------------------
I was trying to re-read the design document, but somehow it does not open for
me.
Lemme re-iterator my main concern with stats: More than how we encode them at
the client it is *very* important that we can decouple the granularity stored
in SYSTEM.STATS from both the granularity stored on the client and the
granularity of the scanning.
For example we might 10 million guideposts for a table on the server, on the
client we group them into 10000 condensed guideposts (each one being the result
of grouping 1000 guideposts together), then we might decide to execute these
with 1000 scan tasks (grouping 10 of the client guideposts into a scan).
My main point is that these are different:
* The guideposts stored in SYSTEM.STAT represent information about the
distribution of the keys. Nothing more.
* The guideposts on the client are lower-resolution cache of that information,
that can be used for quick size estimations.
* The number of scans represent the actual work and depends on the number of
regionserver involved in a query and how many resources we want to use.
That way we could have a configuration with as little as 10MB guideposts, and
(say) no more than 1000 guideposts per table cached at the client, and (say) no
more than 10 scans/regionserver for a query. That would still work well up to
maybe 100TB tables.
Or we can have 100MB guideposts at the server, etc, etc. That would work well
to 1PB tables perhaps.
Then we can scan the server data for very precise size estimations and use the
cached client guideposts for a quick and fast coarse estimation.
But my main point remains: More important that how we store them will be how we
can group them.
Storing them in a tree structure helps with that... And maybe it's part of the
design (as I said I can't open the doc).
> Use a Variant Segment tree to organize Guide Post Info
> ------------------------------------------------------
>
> Key: PHOENIX-4925
> URL: https://issues.apache.org/jira/browse/PHOENIX-4925
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Bin Shi
> Assignee: Bin Shi
> Priority: Major
> Attachments: PHOENIX-4925.phoenix-stats.0502.patch,
> PHOENIX-4925.phoenix-stats.0510.patch
>
> Time Spent: 13h 40m
> Remaining Estimate: 0h
>
> As reported, Query compilation (for the sample queries showed below),
> especially deriving estimation and generating parallel scans from guide
> posts, becomes much slower after we introduced Phoenix Stats.
> a. SELECT f1__c FROM MyCustomBigObject__b ORDER BY Pk1__c
> b. SELECT f1__c FROM MyCustomBigObject__b WHERE nonpk1__c = ‘x’ ORDER BY
> Pk1__c
> c. SELECT f1__c FROM MyCustomBigObject__b WHERE pk2__c = ‘x’ ORDER BY
> pk1__c,pk2__c
> d. SELECT f1__c FROM MyCustomBigObject__b WHERE pk1__c = ‘x’ AND nonpk1__c
> ORDER BY pk1__c,pk2__c
> e. SELECT f1__c FROM MyCustomBigObject__b WHERE pk__c >= 'd' AND pk__c <=
> 'm' OR pk__c >= 'o' AND pk__c <= 'x' ORDER BY pk__c // pk__c is the only
> column to make the primary key.
>
> By using prefix encoding for guide post info, we have to decode and traverse
> guide posts sequentially, which causes time complexity in
> BaseResultIterators.getParallelScan(...) to be O( n ) , where n is the total
> count of guide posts.
> According to PHOENIX-2417, to reduce footprint in client cache and over
> transmition, the prefix encoding is used as in-memory and over-the-wire
> encoding for guide post info.
> We can use Segment Tree to address both memory and performance concerns. The
> guide posts are partitioned to k chunks (k=1024?), each chunk is encoded by
> prefix encoding and the encoded data is a leaf node of the tree. The inner
> node contains summary info (the count of rows, the data size) of the sub tree
> rooted at the inner node.
> With this tree like data structure, compared to the current data structure,
> the increased size (mainly coming from the n/k-1 inner nodes) is ignorable.
> The time complexity for queries a, b, c can be reduced to O(m) where m is the
> total count of regions; the time complexity for "EXPLAN" queries a, b, c can
> be reduced to O(m) too, and if we support "EXPLAIN (ESTIMATE ONLY)", it can
> even be reduced to O(1). For queries d and e, the time complexity to find the
> start of target scan ranges can be reduced to O(log(n/k)).
> The tree can also integrate AVL and B+ characteristics to support partial
> load/unload when interacting with stats client cache.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)