[ https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258950#comment-14258950 ]
ramkrishna.s.vasudevan commented on PHOENIX-1453: ------------------------------------------------- [~giacomotaylor] Thinking more deeply in to this. Since we have decided to go with rowCount as BIGINT instead of BIGINT[], so for every guide post entry for the region (per CF) we are going to persist the total number of rows. So for a single CF case this rowCount is going to represent the total number of rows in that region, right? Correct me if am wrong here. Because if we go with BIGINT[] we would capture exactly how many rows between guideposts but that would not be the case here when going with BIGINT. So once we capture the total number of rows in the rowCount column, while we combine the guideposts for a table while reading the STATS table entries, then when we want to form the GuidePostsInfo i.e the array of rowCount and array of byteCount you want to create an array that is equal in size of the list of guideposts? Consider the example that you stated above (a,b,c) bytecount = 40, rowcount = 10 (region 1) (g, h, i) byteCount = 80, rowCount = 20 (region 2) Now when the PTableStats is formed by iterating over all the entries for all the regions we would combine like (a, b, c, g, h ,i), byteCountArr(40, 80), rowCountArr(10, 20). I would say this is fine because on adding up the rowcountArr you will end up in total number of rows in that table. In the above example you had stated that we could create a rowCountArr ( 10, 10, 10, 20, 20, 20). But is this really correct? Because it would mean that between the guide posts a -> b we had 10 rows and b->c we had 10 rows. Other thing we could do is if we really need to flatten the rowcountArr and byteCountARr then we could say that (a,b,c,g,h,i), byteCountArr(10, 10, 10, 20, 20, 20) and rowContArr(3,3,3,5,5,5). This would mean that we approximate the number of bytecount and the number of rowcount based on the average based on the number of guidePosts entries. But if we really want to know the exact number of rows between the gps then we have to track it as BIGINT[] array only. Also as discussed in order to make this implementation better, created GuidePostsInfo and GuidePostsRegionInfo. The GuidePostsRegionInfo would be used while actually writing the stats entries per region to the stats table. When we iterate the stats table to create PTableStats when we create GuidePostsInfo object which has the combine() API that combines different guidepostsRegionInfo into one GuidePostsInfo which will have List<byte[]> gps, long[] rowCount, long[] byteCount. What do you think [~giacomotaylor]. > Collect row counts per region in stats table > -------------------------------------------- > > Key: PHOENIX-1453 > URL: https://issues.apache.org/jira/browse/PHOENIX-1453 > Project: Phoenix > Issue Type: Sub-task > Reporter: James Taylor > Assignee: ramkrishna.s.vasudevan > Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, > Phoenix-1453_2.patch, Phoenix-1453_3.patch, Phoenix-1453_7.patch, > Phoenix-1453_8.patch > > > We currently collect guideposts per equal chunk, but we should also capture > row counts. Should we have a parallel array with the guideposts that count > rows per guidepost, or is it enough to have a per region count? -- This message was sent by Atlassian JIRA (v6.3.4#6332)