[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

James Taylor (JIRA) Mon, 05 Jan 2015 00:15:07 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264326#comment-14264326
 ]


James Taylor commented on PHOENIX-1453:
---------------------------------------

Thanks for the patch, [~ramkrishna]. Here's some feedback:
- Thanks for adding this proto, as this is an improvement. Minor nit, but can 
you name these consistently as byteCount and rowCount?
{code}
+message PGuidePosts {
+  repeated bytes guidePostsList = 1;
+  optional int64 guidePostsByteCount = 2;
+  optional int64 rowCount = 3;
+}
{code}
- Why is this code passing through rowCount like this? It should only be adding 
a new row key guidepost and adjusting the byteCount. The row count is adjusted 
outside the loop.
{code}
+                if (gps.getSecond().addGuidePost(row, byteCount, 
gps.getSecond().getRowCount())) {
+                    gps.setFirst(0l);
+                }
{code}
- When would a merge sort for GuidePostsRegionInfo guideposts be required? 
Seems unnecessary and potentially expensive.
{code}
+    /**
+     * Combines the GuidePosts per region into one.
+     * @param oldInfo
+     */
+    public void combine(GuidePostsRegionInfo oldInfo) {
+        // FIXME: I don't think we need to do a merge sort here, as the keys 
won't be interleaved.
+        // We just need to concatenate them in the correct way.
+        this.guidePosts = ImmutableList.copyOf(Iterators.mergeSorted(
+                ImmutableList.of(this.getGuidePosts().iterator(), 
oldInfo.getGuidePosts().iterator()),
+                Bytes.BYTES_COMPARATOR));
+        this.byteCount += oldInfo.getByteCount();
+        this.keyByteSize += oldInfo.keyByteSize;
+        this.rowCount += oldInfo.getRowCount();
+    }
{code}
- Remove all System.out.println calls please (as I mentioned before)...
{code}
+            System.out.println("The bytecount is "+gp.getByteCount());
{code}
- This is kind of ugly. Can you create a static method on GuidePostsRegionInfo 
instead that takes in rowCount as well as the bytes ptr to the guideposts? 
Also, can you rename the toBytes() and fromBytes() methods, as we store more 
than just the guide post values now? How about serializeGuidePosts() and 
deserializeGuidePosts() and the deserialize would just return a List<byte[]> 
instead of instantiating a GuidePostsRegionInfo?
{code}
+                GuidePostsRegionInfo newGPRegionInfo = 
GuidePostsRegionInfo.fromBytes(valuePtr.get(),
+                        valuePtr.getOffset(), valuePtr.getLength());
+                newGPRegionInfo.addRowCount(rowCount);
{code}
- You need to get the value of GUIDE_POSTS_WIDTH for each row too and 
accumulate it just like row count. Add this to the static constructor for 
GuidePostsRegionInfo mentioned above as well. We should be summing up both 
rowCount and byteCount across all regions (per column family).
{code}
         s.addColumn(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, 
PhoenixDatabaseMetaData.GUIDE_POSTS_BYTES);
+        s.addColumn(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, 
PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES);
{code}
- This code is more efficient if you do a put first and then only a combine on 
the old value returned from put. Otherwise, you're doing two gets on the map 
(one for the get and one for the put). Look at the way the code was written 
before.
{code}
+            if (cfName != null) {
+                GuidePostsRegionInfo newGPRegionInfo = 
GuidePostsRegionInfo.fromBytes(valuePtr.get(),
+                        valuePtr.getOffset(), valuePtr.getLength());
+                newGPRegionInfo.addRowCount(rowCount);
+                GuidePostsInfo guidePostsInfo = guidePostsPerCf.get(cfName);
+                if (guidePostsInfo == null) {
+                    guidePostsPerCf.put(cfName, new 
GuidePostsInfo(newGPRegionInfo.getByteCount() ,
+                            newGPRegionInfo.getGuidePosts(), 
newGPRegionInfo.getRowCount()));
+                } else {
+                    if (!newGPRegionInfo.getGuidePosts().isEmpty()) {
+                        guidePostsInfo.combine(newGPRegionInfo);
+                    }
+                }
+            }
{code}
- Please ensure there are tests to check the byteCount value passed over in 
PTableStats for each column family, as based on the above changes, I don't 
think it would have been working correctly. In particular, if there were 
multiple regions, I think the byteCount would be off (as we want it to be the 
sum of all of the regions now).
- Why is this test explicitly setting KEEP_DELETED_CELLS=false, as that's the 
default now?
{code}
+                "CREATE TABLE " + STATS_TEST_TABLE_NAME_NEW
+                        + "(k VARCHAR PRIMARY KEY, a.v INTEGER, b.v INTEGER, 
c.v INTEGER NULL, d.v INTEGER NULL) "
+                        + HColumnDescriptor.KEEP_DELETED_CELLS + "=" + 
Boolean.FALSE);
{code}
- Minor nit, but this comment is a bit confusing. How about this: CF A has 
double the byte count because it contains both column qualifier "V" and default 
qualifier "_0".
{code}
+        // CF A alone has double the bytecount because one is due to the 
default qualifier _0 and
+        // other due to the CF A
{code}


> Collect row counts per region in stats table
> --------------------------------------------
>
>                 Key: PHOENIX-1453
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1453
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, 
> Phoenix-1453_10.patch, Phoenix-1453_13.patch, Phoenix-1453_15.patch, 
> Phoenix-1453_2.patch, Phoenix-1453_3.patch, Phoenix-1453_7.patch, 
> Phoenix-1453_8.patch
>
>
> We currently collect guideposts per equal chunk, but we should also capture 
> row counts. Should we have a parallel array with the guideposts that count 
> rows per guidepost, or is it enough to have a per region count?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

Reply via email to