[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

James Taylor (JIRA) Sun, 30 Nov 2014 18:17:32 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229339#comment-14229339
 ]


James Taylor commented on PHOENIX-1453:
---------------------------------------

Thanks for the patch, [~ramkrishna]. Here's some feedback:
{code}
+        // Adding both - Hope it maintains the order
+        oldInfo.getRowCounts().addAll(this.getRowCounts());
+        this.rowCounts = ImmutableList.copyOf(oldInfo.getRowCounts());

Don't mutate the oldInfo.getRowCounts(), instead create a new list.

+                                // TODO : what should be time that should be 
specified here?
+                                addColumnsIfNotExists(metaConnection, 
+                                        SYSTEM_CATALOG_SCHEMA + ".\"" + 
SYSTEM_STATS_TABLE, ignore.getTable().getTimeStamp(), 
+                                        
PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT);
{code}

Use MIN_SYSTEM_TABLE_TIMESTAMP for the timestamp, as then the SYSTEM.CATALOG 
table will have that as it's timestamp and other callers would get the 
NewerTableFoundException and not try the upgrade.

{code}
+    public GuidePostsInfo(long byteCount, List<byte[]> guidePosts, List<Long> 
rowCounts) {
+        this.byteCount = byteCount;
+        this.guidePosts = ImmutableList.copyOf(guidePosts);
+        int size = 0;
+        for (byte[] key : guidePosts) {
+            size += key.length;
+        }
+        this.keyByteSize = size;
+        this.rowCounts = rowCounts;
{code}

Make immutable copy for rowCounts in this constructor.

{code}
    public GuidePostsInfo(long byteCount, List<byte[]> guidePosts) {
{code}

Is this constructor used anymore? If not, please remove. If yes, then call the 
new three arg one here.

{code}
@@ -194,14 +196,26 @@ public class StatisticsCollector {
     
     @SuppressWarnings("deprecation")
     public void updateStatistic(KeyValue kv) {
-        ImmutableBytesPtr cfKey = new ImmutableBytesPtr(kv.getBuffer(), 
kv.getFamilyOffset(), kv.getFamilyLength());
+        ImmutableBytesPtr cfKey = new ImmutableBytesPtr(kv.getFamilyArray(), 
kv.getFamilyOffset(), kv.getFamilyLength());
+        Pair<ImmutableBytesPtr, Long> rowKey = famVsRow.get(cfKey);
         familyMap.put(cfKey, true);
-        
+        if (rowKey == null
+                || !Bytes.equals(kv.getRowArray(), kv.getRowOffset(), 
kv.getRowLength(), rowKey.getFirst().get(),
+                        rowKey.getFirst().getOffset(), 
rowKey.getFirst().getLength())) {
+            if (rowKey != null) {
+                rowKey.getFirst().set(kv.getRowArray(), kv.getRowOffset(), 
kv.getRowLength());
+                long rowCount = rowKey.getSecond();
+                rowKey.setSecond(++rowCount);
+            } else {
+                rowKey = new Pair<ImmutableBytesPtr, Long>(new 
ImmutableBytesPtr(kv.getRowArray(), kv.getRowOffset(),
+                        kv.getRowLength()), 1l);
+            }
+            famVsRow.put(cfKey, rowKey);
+        }
{code}
You've already got rowCount in the GuidePostsInfo class, so it seems a bit 
messy to track this in a different map. How about just tracking both the number 
of bytes here and the row count in this routine? We're currently only tracking 
number of bytes in the Long in Pair<Long,GuidePostsInfo>. Instead create a new 
class (GuidePostsState?) with both a long rowCount and a long byteCount. Then 
pass through both the rowCount and the byteCount here like this: 
gps.getSecond().addGuidePost(row, byteCount, rowCount). 

{code}
     public static GuidePostsInfo fromBytes(byte[] buf, int offset, int l) {
         try {
@@ -99,7 +129,7 @@ public class GuidePostsInfo {
                             }
                         }
                     }
-                    return new GuidePostsInfo(byteCount, guidePosts);
+                    return new GuidePostsInfo(byteCount, guidePosts, 
Collections.<Long>emptyList());
                 } catch (IOException e) {
                     throw new RuntimeException(e); // not possible
                 } finally {
@@ -122,9 +152,8 @@ public class GuidePostsInfo {
         // Serialize the number of bytes traversed, number of key bytes in the 
region,
         // number of guideposts for that family, 
<<guidepostSize><guidePostsArray>,<guidePostsSize> <guidePostArray>>
         // We will lose precision here?
-        TrustedByteArrayOutputStream bs = new TrustedByteArrayOutputStream(
-                (int)(Bytes.SIZEOF_LONG + Bytes.SIZEOF_LONG + Bytes.SIZEOF_INT 
+ this.keyByteSize + (WritableUtils
-                        .getVIntSize(size) * size)));
+        TrustedByteArrayOutputStream bs = new 
TrustedByteArrayOutputStream((int)(Bytes.SIZEOF_LONG + Bytes.SIZEOF_LONG
+                + Bytes.SIZEOF_INT + this.keyByteSize + 
(WritableUtils.getVIntSize(size) * size)));
{code}

We'll need two toBytes methods here, one to serialize the guidepost keys and 
one to serialize the parallel row counts. How about getGuidePostKeysBytes and 
getRowCountsBytes? The the fromBytes can take both the rowCounts 
array/offset/length and the keys array/offset/length. Then the combine method 
can combine them both.

{code}
+                if (!valuesSet) {
+                    tableNameLength = tableNameBytes.length + 1;
+                    cfOffset = current.getRowOffset() + tableNameLength;
+                    cfLength = getVarCharLength(current.getRowArray(), 
cfOffset, current.getRowLength()
+                            - tableNameLength);
+                    ptr.set(current.getRowArray(), cfOffset, cfLength);
+                    valuesSet = true;
+                }
+                cfName = ByteUtil.copyKeyBytesIfNecessary(ptr);
+                if (Bytes.equals(current.getQualifierArray(), 
current.getQualifierOffset(),
+                        current.getQualifierLength(), 
PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES, 0,
+                        
PhoenixDatabaseMetaData.GUIDE_POSTS_ROW_COUNT_BYTES.length)) {
+                    PhoenixArray arr = 
(PhoenixArray)PDataType.LONG_ARRAY.toObject(current.getValueArray(),
+                            current.getValueOffset(), 
current.getValueLength());
+                    rowCountArr = new long[arr.getDimensions()];
+                    try {
+                        rowCountArr = (long[])arr.getArray();
+                    } catch (SQLException e) {
+                        //should not happen here
+                    }
+                } else {
{code}
How about just an else if for GUIDE_POSTS_ROW_COUNT_BYTES? We'd want the 
timeStamp to be the biggest timestamp, including the cell for 
GUIDE_POSTS_ROW_COUNT_BYTES.


> Collect row counts per region in stats table
> --------------------------------------------
>
>                 Key: PHOENIX-1453
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1453
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>         Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, 
> Phoenix-1453_2.patch, Phoenix-1453_3.patch
>
>
> We currently collect guideposts per equal chunk, but we should also capture 
> row counts. Should we have a parallel array with the guideposts that count 
> rows per guidepost, or is it enough to have a per region count?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1453) Collect row counts per region in stats table

Reply via email to