[GitHub] [hbase] ndimiduk commented on a change in pull request #3244: HBASE-25869 WAL value compression

GitBox Mon, 10 May 2021 15:29:43 -0700


ndimiduk commented on a change in pull request #3244:
URL: https://github.com/apache/hbase/pull/3244#discussion_r629712475




##########
File path: hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java
##########
@@ -230,6 +230,11 @@ public static long getKeyDataStructureSize(int rlength, 
int flength, int qlength
     DeleteColumn((byte)12),
     DeleteFamily((byte)14),
 
+    // Effective maximum is 127 (Byte.MAX_VALUE). We set the high order bit of 
the
+    // type byte in the WAL codecs to indicate, in a backwards compatible way, 
if the
+    // value is compressed there.
+    EffectiveMaximum((byte)Byte.MAX_VALUE),
+
     // Maximum is used when searching; you look from maximum on down.
     Maximum((byte)255);

Review comment:
       Since this compression bit mask effectively cuts the `Type` value space 
in half, why introduce a new `EffectiveMaximum`? It seems to me that you can 
instead replace the value of `Maximum` with the new reduced value of `127`.

##########
File path: hbase-protocol-shaded/src/main/protobuf/server/region/WAL.proto
##########
@@ -32,6 +32,7 @@ message WALHeader {
   optional bool has_tag_compression = 3;
   optional string writer_cls_name = 4;
   optional string cell_codec_cls_name = 5;
+  optional bool has_value_compression = 6;

Review comment:
       I'm surprised to see that we already have WAL compression, but don't 
explicitly track it in the header.
   
   Why is it needed to add a flag to the header at all? Is this so that a 
decompressor can be initialized ahead of reading the first compressed `Type` 
code? Is it better to extend the header than to initialize the decompressor 
dynamically? Is it reasonable to assume that compression will be enabled by 
default, and so the decompressor should be created by default and is simply not 
used if no compressed values are read? (I assume that the `Type` code is 
written without compression, and thus a decompressor is not required in order 
to read it back.)

##########
File path: hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java
##########
@@ -230,6 +230,11 @@ public static long getKeyDataStructureSize(int rlength, 
int flength, int qlength
     DeleteColumn((byte)12),
     DeleteFamily((byte)14),
 
+    // Effective maximum is 127 (Byte.MAX_VALUE). We set the high order bit of 
the
+    // type byte in the WAL codecs to indicate, in a backwards compatible way, 
if the
+    // value is compressed there.
+    EffectiveMaximum((byte)Byte.MAX_VALUE),

Review comment:
       I don't find this approach ugly as such. You're using the highest bit as 
a is-compressed flag, and otherwise retaining the meaning of all existing 
codes. If I understand you correctly, an uncompressed `Put` has a `Type.code` 
of `0b0000 0100` while a compressed `Put` has a `Type.code` of `0b1000 0100`. 
This is fine, so long as we're okay with cutting the size of the `Type` value 
space in half.
   
   For what it's worth, it might seem like less of a "ugly trick" and more of 
an "intentional, explicit design" if you were to instead implement as one of 
the following.
   * Add explicit `Type` entries for `CompressedPut`, `CompressedDelete`, &c., 
with their associated `code` values also explicitly listed.
   * Fold the representation of compression into the `Type` implementation 
itself by having each `Type` carry two `code` values (one for compressed, one 
for uncompressed).
   
   These are simply suggestions; each carry implementation ramifications that I 
cannot estimate without significantly more code reading.
   
   ---
   
   Returning to this comment after reading the bit where the `Type` code is 
written. You might consider adding to this Enum an instance method like `int 
appendTo(OutputStream out)` that encapsulates the details of the `code` 
encoding.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
##########
@@ -34,21 +36,49 @@
 @InterfaceAudience.LimitedPrivate({HBaseInterfaceAudience.COPROC, 
HBaseInterfaceAudience.PHOENIX})
 public class CompressionContext {
 
-  static final String ENABLE_WAL_TAGS_COMPRESSION =
+  public static final String ENABLE_WAL_TAGS_COMPRESSION =
       "hbase.regionserver.wal.tags.enablecompression";
 
+  public static final String ENABLE_WAL_VALUE_COMPRESSION =
+      "hbase.regionserver.wal.value.enablecompression";
+
   public enum DictionaryIndex {
     REGION, TABLE, FAMILY, QUALIFIER, ROW
   }
 
+  static class ValueCompressor {

Review comment:
       nit: it's nice for all classes to have a javadoc header.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java
##########
@@ -34,21 +36,49 @@
 @InterfaceAudience.LimitedPrivate({HBaseInterfaceAudience.COPROC, 
HBaseInterfaceAudience.PHOENIX})
 public class CompressionContext {
 
-  static final String ENABLE_WAL_TAGS_COMPRESSION =
+  public static final String ENABLE_WAL_TAGS_COMPRESSION =

Review comment:
       nit: why make these symbol public? Can making them public be avoided?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
##########
@@ -241,10 +246,27 @@ public void write(Cell cell) throws IOException {
         compression.getDictionary(CompressionContext.DictionaryIndex.FAMILY));
       PrivateCellUtil.compressQualifier(out, cell,
         
compression.getDictionary(CompressionContext.DictionaryIndex.QUALIFIER));
-      // Write timestamp, type and value as uncompressed.
+      // Write timestamp, type and value.
       StreamUtils.writeLong(out, cell.getTimestamp());
-      out.write(cell.getTypeByte());
-      PrivateCellUtil.writeValue(out, cell, cell.getValueLength());
+      byte type = cell.getTypeByte();
+      if (compression.getValueCompressor() != null &&
+          cell.getValueLength() > VALUE_COMPRESS_THRESHOLD) {
+        // Try compressing the cell's value
+        byte[] compressedBytes = compressValue(cell);
+        // Only write the compressed value if we have achieved some space 
savings.
+        if (compressedBytes.length < cell.getValueLength()) {
+          // Set the high bit of type to indicate the value is compressed
+          out.write((byte)(type|0x80));

Review comment:
       Please keep this value `0x80` as a constant in a named variable, that 
can be documented back to the `KeyValue$Type` enum, or, better yet, encapsulate 
it into that object.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
##########
@@ -220,6 +223,8 @@ public ByteString compress(byte[] data, Enum dictIndex) {
     }
   }
 
+  static final int VALUE_COMPRESS_THRESHOLD = 100;

Review comment:
       In researching HBASE-25770, I found similar analysis performed by Jetty 
devs re: gzip compression. I do not know the relationship between gzip and 
Deflate, but if it's useful, IIRC, Jetty settled on a size threshold of 23 
bytes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hbase] ndimiduk commented on a change in pull request #3244: HBASE-25869 WAL value compression

Reply via email to