yihua commented on code in PR #13873:
URL: https://github.com/apache/hudi/pull/13873#discussion_r2334214626
##########
hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileDataBlock.java:
##########
@@ -203,17 +205,30 @@ protected ByteBuffer getUncompressedBlockDataToWrite() {
ByteBuffer dataBuf = ByteBuffer.allocate(context.getBlockSize());
for (KeyValueEntry kv : entriesToWrite) {
// Length of key + length of a short variable indicating length of key.
- dataBuf.putInt(kv.key.length + SIZEOF_INT16);
+ // Note that 10 extra bytes are required by hbase reader.
+ // That is: 1 byte for column family length, 8 bytes for timestamp, 1
bytes for key type.
+ dataBuf.putInt(kv.key.length + SIZEOF_INT16 + SIZEOF_BYTE + SIZEOF_INT64
+ SIZEOF_BYTE);
// Length of value.
dataBuf.putInt(kv.value.length);
// Key content length.
dataBuf.putShort((short)kv.key.length);
// Key.
dataBuf.put(kv.key);
+ // Column family length: constant 0.
+ dataBuf.put((byte)0);
+ // Column qualifier: assume 0 bits.
+ // Timestamp: constant 0.
+ dataBuf.putLong(0L);
+ // Key type: constant Put (4) in Hudi.
+ // Minimum((byte) 0), Put((byte) 4), Delete((byte) 8),
+ // DeleteFamilyVersion((byte) 10), DeleteColumn((byte) 12),
+ // DeleteFamily((byte) 14), Maximum((byte) 255).
+ dataBuf.put((byte)4);
Review Comment:
Why did tests not fail because of this missing logic?
##########
hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileWriterImpl.java:
##########
@@ -219,6 +219,41 @@ private void initFileInfo() {
new byte[]{0});
}
+ protected void finishFileInfo() {
+ // Record last key.
+ fileInfoBlock.add(
+ new String(LAST_KEY.getBytes(), StandardCharsets.UTF_8),
+ addKeyLength(lastKey));
+ fileInfoBlock.setStartOffsetInBuffForWrite(currentOffset);
+
+ // Average key length.
+ int avgKeyLen = totalNumberOfRecords == 0
+ ? 0 : (int) (totalKeyLength / totalNumberOfRecords);
+ fileInfoBlock.add(
+ new String(HFileInfo.AVG_KEY_LEN.getBytes(), StandardCharsets.UTF_8),
+ toBytes(avgKeyLen));
+ fileInfoBlock.add(
+ new String(HFileInfo.FILE_CREATION_TIME_TS.getBytes(),
StandardCharsets.UTF_8),
+ toBytes(context.getFileCreateTime()));
+
+ // Average value length.
+ int avgValueLen = totalNumberOfRecords == 0
+ ? 0 : (int) (totalValueLength / totalNumberOfRecords);
+ fileInfoBlock.add(
+ new String(HFileInfo.AVG_VALUE_LEN.getBytes(), StandardCharsets.UTF_8),
+ toBytes(avgValueLen));
+
+ // NOTE: To make MVCC usage consistent cross different table versions,
+ // we should set following properties.
+ // After table versions <= 8 are deprecated, MVCC byte can be removed from
key-value pair.
+ fileInfoBlock.add(
+ new String(HFileInfo.KEY_VALUE_VERSION.getBytes(),
StandardCharsets.UTF_8),
+ toBytes(KEY_VALUE_VERSION_WITH_MVCC_TS));
+ fileInfoBlock.add(
+ new String(MAX_MVCC_TS_KEY.getBytes(), StandardCharsets.UTF_8),
+ toBytes(1L));
Review Comment:
Max MVCC timestamp should be set to 0 as Hudi does not leverage MVCC
timestamp in HFile so the entries should all have zero MVCC ts. Doing this is
to maintain consistency with the actual MVCC timestamp in the entries, because
the new logic writes 0 as the MVCC ts.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]