[ https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264326#comment-14264326 ]
James Taylor commented on PHOENIX-1453: --------------------------------------- Thanks for the patch, [~ramkrishna]. Here's some feedback: - Thanks for adding this proto, as this is an improvement. Minor nit, but can you name these consistently as byteCount and rowCount? {code} +message PGuidePosts { + repeated bytes guidePostsList = 1; + optional int64 guidePostsByteCount = 2; + optional int64 rowCount = 3; +} {code} - Why is this code passing through rowCount like this? It should only be adding a new row key guidepost and adjusting the byteCount. The row count is adjusted outside the loop. {code} + if (gps.getSecond().addGuidePost(row, byteCount, gps.getSecond().getRowCount())) { + gps.setFirst(0l); + } {code} - When would a merge sort for GuidePostsRegionInfo guideposts be required? Seems unnecessary and potentially expensive. {code} + /** + * Combines the GuidePosts per region into one. + * @param oldInfo + */ + public void combine(GuidePostsRegionInfo oldInfo) { + // FIXME: I don't think we need to do a merge sort here, as the keys won't be interleaved. + // We just need to concatenate them in the correct way. + this.guidePosts = ImmutableList.copyOf(Iterators.mergeSorted( + ImmutableList.of(this.getGuidePosts().iterator(), oldInfo.getGuidePosts().iterator()), + Bytes.BYTES_COMPARATOR)); + this.byteCount += oldInfo.getByteCount(); + this.keyByteSize += oldInfo.keyByteSize; + this.rowCount += oldInfo.getRowCount(); + } {code} - Remove all System.out.println calls please (as I mentioned before)... {code} + System.out.println("The bytecount is "+gp.getByteCount()); {code} - This is kind of ugly. Can you create a static method on GuidePostsRegionInfo instead that takes in rowCount as well as the bytes ptr to the guideposts? Also, can you rename the toBytes() and fromBytes() methods, as we store more than just the guide post values now? How about serializeGuidePosts() and deserializeGuidePosts() and the deserialize would just return a List<byte[]> instead of instantiating a GuidePostsRegionInfo? {code} + GuidePostsRegionInfo newGPRegionInfo = GuidePostsRegionInfo.fromBytes(valuePtr.get(), + valuePtr.getOffset(), valuePtr.getLength()); + newGPRegionInfo.addRowCount(rowCount); {code} - You need to get the value of GUIDE_POSTS_WIDTH for each row too and accumulate it just like row count. Add this to the static constructor for GuidePostsRegionInfo mentioned above as well. We should be summing up both rowCount and byteCount across all regions (per column family). {code} s.addColumn(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, PhoenixDatabaseMetaData.GUIDE_POSTS_BYTES); + s.addColumn(QueryConstants.DEFAULT_COLUMN_FAMILY_BYTES, PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES); {code} - This code is more efficient if you do a put first and then only a combine on the old value returned from put. Otherwise, you're doing two gets on the map (one for the get and one for the put). Look at the way the code was written before. {code} + if (cfName != null) { + GuidePostsRegionInfo newGPRegionInfo = GuidePostsRegionInfo.fromBytes(valuePtr.get(), + valuePtr.getOffset(), valuePtr.getLength()); + newGPRegionInfo.addRowCount(rowCount); + GuidePostsInfo guidePostsInfo = guidePostsPerCf.get(cfName); + if (guidePostsInfo == null) { + guidePostsPerCf.put(cfName, new GuidePostsInfo(newGPRegionInfo.getByteCount() , + newGPRegionInfo.getGuidePosts(), newGPRegionInfo.getRowCount())); + } else { + if (!newGPRegionInfo.getGuidePosts().isEmpty()) { + guidePostsInfo.combine(newGPRegionInfo); + } + } + } {code} - Please ensure there are tests to check the byteCount value passed over in PTableStats for each column family, as based on the above changes, I don't think it would have been working correctly. In particular, if there were multiple regions, I think the byteCount would be off (as we want it to be the sum of all of the regions now). - Why is this test explicitly setting KEEP_DELETED_CELLS=false, as that's the default now? {code} + "CREATE TABLE " + STATS_TEST_TABLE_NAME_NEW + + "(k VARCHAR PRIMARY KEY, a.v INTEGER, b.v INTEGER, c.v INTEGER NULL, d.v INTEGER NULL) " + + HColumnDescriptor.KEEP_DELETED_CELLS + "=" + Boolean.FALSE); {code} - Minor nit, but this comment is a bit confusing. How about this: CF A has double the byte count because it contains both column qualifier "V" and default qualifier "_0". {code} + // CF A alone has double the bytecount because one is due to the default qualifier _0 and + // other due to the CF A {code} > Collect row counts per region in stats table > -------------------------------------------- > > Key: PHOENIX-1453 > URL: https://issues.apache.org/jira/browse/PHOENIX-1453 > Project: Phoenix > Issue Type: Sub-task > Reporter: James Taylor > Assignee: ramkrishna.s.vasudevan > Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, > Phoenix-1453_10.patch, Phoenix-1453_13.patch, Phoenix-1453_15.patch, > Phoenix-1453_2.patch, Phoenix-1453_3.patch, Phoenix-1453_7.patch, > Phoenix-1453_8.patch > > > We currently collect guideposts per equal chunk, but we should also capture > row counts. Should we have a parallel array with the guideposts that count > rows per guidepost, or is it enough to have a per region count? -- This message was sent by Atlassian JIRA (v6.3.4#6332)