[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

James Taylor (JIRA) Mon, 01 Dec 2014 13:35:16 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230500#comment-14230500
 ]


James Taylor commented on PHOENIX-1453:
---------------------------------------

That's true, [~lhofhansl]. When we do a raw scan, like this, don't we know that 
each List<Cell> will be a new row?
{code}
innerScanner.nextRaw(results);
{code}
and then the other entry point is in our StatisticsScanner for these two method:
{code}
    public boolean next(List<Cell> result) throws IOException {
    public boolean next(List<Cell> result, int limit) throws IOException {
{code}

If we changed StatisticsCollector.updateStatistic(KeyValue kv) to pass in a 
List<Cell>, would we no longer need to do the key comparison, but could just 
increment a counter?

Having the row count is useful, but not essential. More important are the equal 
width guideposts since this indicates how much data will be scanned. The row 
count would be used to control the optimization we do for a LIMIT query. We 
currently estimate the row size based on the schema, multiply by the LIMIT and 
if it's estimated to be less than one region's worth then we run the query 
serially (see ScanPlan.isSerial()). Having the row count would let us estimate 
the average row size more accurately. Maybe there's a better way to do that? Or 
maybe it's fine as-is, since it's kind of squishy already. I'd guess that we'd 
use the row count for other optimizations down the road, but I'm not positive.

If we think it's important to have the row count and need to do the key 
comparison, it'd be good to get a realistic measure of the overhead for doing 
the key comparison. 


> Collect row counts per region in stats table
> --------------------------------------------
>
>                 Key: PHOENIX-1453
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1453
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, 
> Phoenix-1453_2.patch, Phoenix-1453_3.patch
>
>
> We currently collect guideposts per equal chunk, but we should also capture 
> row counts. Should we have a parallel array with the guideposts that count 
> rows per guidepost, or is it enough to have a per region count?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

Reply via email to