[ https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229339#comment-14229339 ]
James Taylor commented on PHOENIX-1453: --------------------------------------- Thanks for the patch, [~ramkrishna]. Here's some feedback: {code} + // Adding both - Hope it maintains the order + oldInfo.getRowCounts().addAll(this.getRowCounts()); + this.rowCounts = ImmutableList.copyOf(oldInfo.getRowCounts()); Don't mutate the oldInfo.getRowCounts(), instead create a new list. + // TODO : what should be time that should be specified here? + addColumnsIfNotExists(metaConnection, + SYSTEM_CATALOG_SCHEMA + ".\"" + SYSTEM_STATS_TABLE, ignore.getTable().getTimeStamp(), + PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT); {code} Use MIN_SYSTEM_TABLE_TIMESTAMP for the timestamp, as then the SYSTEM.CATALOG table will have that as it's timestamp and other callers would get the NewerTableFoundException and not try the upgrade. {code} + public GuidePostsInfo(long byteCount, List<byte[]> guidePosts, List<Long> rowCounts) { + this.byteCount = byteCount; + this.guidePosts = ImmutableList.copyOf(guidePosts); + int size = 0; + for (byte[] key : guidePosts) { + size += key.length; + } + this.keyByteSize = size; + this.rowCounts = rowCounts; {code} Make immutable copy for rowCounts in this constructor. {code} public GuidePostsInfo(long byteCount, List<byte[]> guidePosts) { {code} Is this constructor used anymore? If not, please remove. If yes, then call the new three arg one here. {code} @@ -194,14 +196,26 @@ public class StatisticsCollector { @SuppressWarnings("deprecation") public void updateStatistic(KeyValue kv) { - ImmutableBytesPtr cfKey = new ImmutableBytesPtr(kv.getBuffer(), kv.getFamilyOffset(), kv.getFamilyLength()); + ImmutableBytesPtr cfKey = new ImmutableBytesPtr(kv.getFamilyArray(), kv.getFamilyOffset(), kv.getFamilyLength()); + Pair<ImmutableBytesPtr, Long> rowKey = famVsRow.get(cfKey); familyMap.put(cfKey, true); - + if (rowKey == null + || !Bytes.equals(kv.getRowArray(), kv.getRowOffset(), kv.getRowLength(), rowKey.getFirst().get(), + rowKey.getFirst().getOffset(), rowKey.getFirst().getLength())) { + if (rowKey != null) { + rowKey.getFirst().set(kv.getRowArray(), kv.getRowOffset(), kv.getRowLength()); + long rowCount = rowKey.getSecond(); + rowKey.setSecond(++rowCount); + } else { + rowKey = new Pair<ImmutableBytesPtr, Long>(new ImmutableBytesPtr(kv.getRowArray(), kv.getRowOffset(), + kv.getRowLength()), 1l); + } + famVsRow.put(cfKey, rowKey); + } {code} You've already got rowCount in the GuidePostsInfo class, so it seems a bit messy to track this in a different map. How about just tracking both the number of bytes here and the row count in this routine? We're currently only tracking number of bytes in the Long in Pair<Long,GuidePostsInfo>. Instead create a new class (GuidePostsState?) with both a long rowCount and a long byteCount. Then pass through both the rowCount and the byteCount here like this: gps.getSecond().addGuidePost(row, byteCount, rowCount). {code} public static GuidePostsInfo fromBytes(byte[] buf, int offset, int l) { try { @@ -99,7 +129,7 @@ public class GuidePostsInfo { } } } - return new GuidePostsInfo(byteCount, guidePosts); + return new GuidePostsInfo(byteCount, guidePosts, Collections.<Long>emptyList()); } catch (IOException e) { throw new RuntimeException(e); // not possible } finally { @@ -122,9 +152,8 @@ public class GuidePostsInfo { // Serialize the number of bytes traversed, number of key bytes in the region, // number of guideposts for that family, <<guidepostSize><guidePostsArray>,<guidePostsSize> <guidePostArray>> // We will lose precision here? - TrustedByteArrayOutputStream bs = new TrustedByteArrayOutputStream( - (int)(Bytes.SIZEOF_LONG + Bytes.SIZEOF_LONG + Bytes.SIZEOF_INT + this.keyByteSize + (WritableUtils - .getVIntSize(size) * size))); + TrustedByteArrayOutputStream bs = new TrustedByteArrayOutputStream((int)(Bytes.SIZEOF_LONG + Bytes.SIZEOF_LONG + + Bytes.SIZEOF_INT + this.keyByteSize + (WritableUtils.getVIntSize(size) * size))); {code} We'll need two toBytes methods here, one to serialize the guidepost keys and one to serialize the parallel row counts. How about getGuidePostKeysBytes and getRowCountsBytes? The the fromBytes can take both the rowCounts array/offset/length and the keys array/offset/length. Then the combine method can combine them both. {code} + if (!valuesSet) { + tableNameLength = tableNameBytes.length + 1; + cfOffset = current.getRowOffset() + tableNameLength; + cfLength = getVarCharLength(current.getRowArray(), cfOffset, current.getRowLength() + - tableNameLength); + ptr.set(current.getRowArray(), cfOffset, cfLength); + valuesSet = true; + } + cfName = ByteUtil.copyKeyBytesIfNecessary(ptr); + if (Bytes.equals(current.getQualifierArray(), current.getQualifierOffset(), + current.getQualifierLength(), PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES, 0, + PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES.length)) { + PhoenixArray arr = (PhoenixArray)PDataType.LONG_ARRAY.toObject(current.getValueArray(), + current.getValueOffset(), current.getValueLength()); + rowCountArr = new long[arr.getDimensions()]; + try { + rowCountArr = (long[])arr.getArray(); + } catch (SQLException e) { + //should not happen here + } + } else { {code} How about just an else if for GUIDE_POSTS_ROW_COUNT_BYTES? We'd want the timeStamp to be the biggest timestamp, including the cell for GUIDE_POSTS_ROW_COUNT_BYTES. > Collect row counts per region in stats table > -------------------------------------------- > > Key: PHOENIX-1453 > URL: https://issues.apache.org/jira/browse/PHOENIX-1453 > Project: Phoenix > Issue Type: Sub-task > Reporter: James Taylor > Assignee: ramkrishna.s.vasudevan > Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, > Phoenix-1453_2.patch, Phoenix-1453_3.patch > > > We currently collect guideposts per equal chunk, but we should also capture > row counts. Should we have a parallel array with the guideposts that count > rows per guidepost, or is it enough to have a per region count? -- This message was sent by Atlassian JIRA (v6.3.4#6332)