apurtell commented on a change in pull request #3244:
URL: https://github.com/apache/hbase/pull/3244#discussion_r628787521
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
##########
@@ -256,6 +270,28 @@ public void write(Cell cell) throws IOException {
}
}
}
+
+ public static void writeCompressedValue(OutputStream out, byte[]
valueArray, int offset,
+ int vlength, Deflater deflater) throws IOException {
+ byte[] buffer = new byte[4096];
+ ByteArrayOutputStream baos = new ByteArrayOutputStream();
+ deflater.reset();
+ deflater.setInput(valueArray, offset, vlength);
+ boolean finished = false;
+ do {
+ int bytesOut = deflater.deflate(buffer);
+ if (bytesOut > 0) {
+ baos.write(buffer, 0, bytesOut);
+ } else {
+ bytesOut = deflater.deflate(buffer, 0, buffer.length,
Deflater.FULL_FLUSH);
Review comment:
Currently we completely flush the encoder at the end of every value
(FULL_FLUSH). This is very conservative, allowing each value to be decompressed
individually (so is resilient to corruptions), but impacts compression ratio.
Meanwhile our custom dictionary scheme accumulates strings over the whole file,
so we are being conservative in value compression in a way our custom scheme
defeats anyway. We can instead also let the Deflater build its dictionary over
all values in the WAL, using SYNC_FLUSH instead.
Note then we can't create a Deflater for each value in
`readCompressedValue`, we have to init one in the `CompressionContext` when we
set up the context and reuse it, so it can build its dictionary in the same way
as the deflater did, over all values in the file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]