[jira] [Comment Edited] (HBASE-12313) Redo the hfile index length optimization so cell-based rather than serialized KV key

Anoop Sam John (JIRA) Sun, 26 Oct 2014 03:18:02 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184438#comment-14184438
 ]


Anoop Sam John edited comment on HBASE-12313 at 10/26/14 10:16 AM:
-------------------------------------------------------------------

{code}
       for (Cell cell : rr.rawCells()) {
-        resultSize += CellUtil.estimatedLengthOf(cell);
+        resultSize += CellUtil.estimatedSerializedSizeOf(cell);
{code}
estimatedLengthOf was returning the total length. estimatedSerializedSizeOf() 
is having extra count 4 bytes +.   Do you want to change really Stack?

Replacing estimatedLengthOf with estimatedSerializedSizeOf is correct?

{code}
+  private static int getSumOfKeyElementLengths(final Cell cell) {
+    return cell.getRowLength() + cell.getFamilyLength() +
+    cell.getQualifierLength() +
+    cell.getValueLength() +
+    cell.getTagsLength() +
+    KeyValue.TIMESTAMP_TYPE_SIZE;
+  }
+
+  public static int estimatedSerializedSizeOfKey(final Cell cell) {
+    if (cell instanceof KeyValue) return ((KeyValue)cell).getKeyLength();
+    // This will be a low estimate.  Will do for now.
+    return getSumOfKeyElementLengths(cell);
+  }
{code}
getSumOfKeyElementLengths - including lengths of tags and value?

{code}
-    return cell.getRowLength() + cell.getFamilyLength() + 
cell.getQualifierLength()
-        + cell.getValueLength() + cell.getTagsLength() + 
KeyValue.TIMESTAMP_TYPE_SIZE;
+    // TODO: Add sizing of references that hold the row, family, etc., arrays.
+    return estimatedSerializedSizeOf(cell);
{code}
No need to add the extra 4 bytes for heapSize which will come in 
estimatedSerializedSizeOf (?)

{code}
+  public static String getCellKeyAsString(Cell cell) {
+    StringBuilder sb = new StringBuilder(Bytes.toStringBinary(
+      cell.getRowArray(), cell.getRowOffset(), cell.getRowLength()));
+    sb.append(cell.getFamilyLength() == 0? "":
+      Bytes.toStringBinary(cell.getFamilyArray(), cell.getFamilyOffset(), 
cell.getFamilyLength()));
+    sb.append(cell.getQualifierLength() == 0? "":
+      Bytes.toStringBinary(cell.getQualifierArray(), cell.getQualifierOffset(),
+        cell.getQualifierLength()));
{code}
Can we add a separator in between rk, f and q parts?

{code}
-    // h goes to the next block
-    assertEquals(-2, scanner.seekTo(toKV("h", tagUsage)));
+    // 'h' does not exist so we will get a '1' back for not found.
+    assertEquals(0, scanner.seekTo(toKV("i", tagUsage)));
assertEquals("i", toRowStr(scanner.getKeyValue()));
{code}
What if we do seekTo 'h' only ?


{code}
-    assertEquals(1, blockIndexReader.rootBlockContainingKey(
-        toKV("h", tagUsage)));
+    // 'h', being midpoint between 'g' and 'i', used to be the block index key 
because of the
+    // little optimization done creating block index keys where we try to get 
a midpoint and then
+    // make this midpoint short as possible so index blocks are kept tight. 
But now, we won't do
+    // the 'optimization' -- create new key -- if there is no gain to be had 
by way of making
+    // a shorter key; in this case we just use the start key in the index.  
This means the below
+    // test changes.  Looking for 'h', it'll be in the 0 block rather than 1 
block now (though 'h'
+    // does not exist in this file).
+    assertEquals(0, blockIndexReader.rootBlockContainingKey(toKV("h", 
tagUsage)));
{code}
Read your comment to see why is the change. Will this change in mid point calc 
make any issue in reads?


was (Author: anoop.hbase):
{code}
       for (Cell cell : rr.rawCells()) {
-        resultSize += CellUtil.estimatedLengthOf(cell);
+        resultSize += CellUtil.estimatedSerializedSizeOf(cell);
{code}
estimatedLengthOf was returning the total length. estimatedSerializedSizeOf() 
is having extra count 4 bytes +.   Do you want to change really Stack?

Replacing estimatedLengthOf with estimatedSerializedSizeOf is correct?

> Redo the hfile index length optimization so cell-based rather than serialized 
> KV key
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-12313
>                 URL: https://issues.apache.org/jira/browse/HBASE-12313
>             Project: HBase
>          Issue Type: Sub-task
>          Components: regionserver, Scanners
>            Reporter: stack
>            Assignee: stack
>         Attachments: 
> 0001-HBASE-12313-Redo-the-hfile-index-length-optimization.patch, 
> 0001-HBASE-12313-Redo-the-hfile-index-length-optimization.patch, 
> 0001-HBASE-12313-Redo-the-hfile-index-length-optimization.patch, 
> 0001-HBASE-12313-Redo-the-hfile-index-length-optimization.patch, 
> 0001-HBASE-12313-Redo-the-hfile-index-length-optimization.patch, 12313v5.txt
>
>
> Trying to remove API that returns the 'key' of a KV serialized into a byte 
> array is thorny.
> I tried to move over the first and last key serializations and the hfile 
> index entries to be cell but patch was turning massive.  Here is a smaller 
> patch that just redoes the optimization that tries to find 'short' midpoints 
> between last key of last block and first key of next block so it is 
> Cell-based rather than byte array based (presuming Keys serialized in a 
> certain way).  Adds unit tests which we didn't have before.
> Also remove CellKey.  Not needed... at least not yet.  Its just utility for 
> toString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-12313) Redo the hfile index length optimization so cell-based rather than serialized KV key

Reply via email to