[ 
https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258950#comment-14258950
 ] 

ramkrishna.s.vasudevan commented on PHOENIX-1453:
-------------------------------------------------

[~giacomotaylor]
Thinking more deeply in to this.
Since we have decided to go with rowCount as BIGINT instead of BIGINT[], so for 
every guide post entry for the region (per CF) we are going to persist the 
total number of rows. So for a single CF case this rowCount is going to 
represent the total number of rows in that region, right? Correct me if am 
wrong here. Because if we go with BIGINT[] we would capture exactly how many 
rows between guideposts but that would not be the case here when going with 
BIGINT.

So once we capture the total number of rows in the rowCount column, while we 
combine the guideposts for a table while reading the STATS table entries, then 
when we want to form the GuidePostsInfo i.e the array of rowCount and array of 
byteCount you want to create an array that is equal in size of the list of 
guideposts? 
Consider the example that you stated above
(a,b,c) bytecount = 40, rowcount = 10 (region 1)
(g, h, i) byteCount = 80, rowCount = 20 (region 2)

Now when the PTableStats is formed by iterating over all the entries for all 
the regions we would combine like
(a, b, c, g, h ,i), byteCountArr(40, 80), rowCountArr(10, 20). I would say this 
is fine because on adding up the rowcountArr you will end up in total number of 
rows in that table.
In the above example you had stated that we could create a rowCountArr ( 10, 
10, 10, 20, 20, 20).  But is this really correct? Because it would mean that 
between the guide posts a -> b we had 10 rows and b->c we had 10 rows.

Other thing we could do is if we really need to flatten the rowcountArr and 
byteCountARr then we could say that 
(a,b,c,g,h,i), byteCountArr(10, 10, 10, 20, 20, 20) and 
rowContArr(3,3,3,5,5,5).  This would mean that we approximate the number of 
bytecount and the number of rowcount based on the average based on the number 
of guidePosts entries.

But if we really want to know the exact number of rows between the gps then we 
have to track it as BIGINT[] array only. 
Also as discussed in order to make this implementation better, created 
GuidePostsInfo and GuidePostsRegionInfo.  The GuidePostsRegionInfo would be 
used while actually writing the stats entries per region to the stats table. 
When we iterate the stats table to create PTableStats when we create 
GuidePostsInfo object which has the combine() API that combines different 
guidepostsRegionInfo into one GuidePostsInfo which will have List<byte[]> gps, 
long[] rowCount, long[] byteCount. 
What do you think [~giacomotaylor].



> Collect row counts per region in stats table
> --------------------------------------------
>
>                 Key: PHOENIX-1453
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1453
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, 
> Phoenix-1453_2.patch, Phoenix-1453_3.patch, Phoenix-1453_7.patch, 
> Phoenix-1453_8.patch
>
>
> We currently collect guideposts per equal chunk, but we should also capture 
> row counts. Should we have a parallel array with the guideposts that count 
> rows per guidepost, or is it enough to have a per region count?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to