yihua opened a new pull request, #9315: URL: https://github.com/apache/hudi/pull/9315
### Change Logs This PR changes the format of delete log blocks and upgrades the log format from version 2 to 3. After this PR, the delete records (record key, partition path, and ordering value) are serialized in Avro and then written into the delete log block, instead using Kyro which is a serialization framework for Java. This makes delete blocks deserializable in programming languages other than Java, as Avro is supported by common programming languages such as [Java](https://github.com/apache/avro/tree/master/lang/java), [C/C++](https://github.com/apache/avro/tree/master/lang/c), [Python](https://github.com/apache/avro/tree/master/lang/py), and [Rust](https://github.com/apache/avro/tree/master/lang/rust). Changes are: - New avro schema for delete record: `HoodieDeleteRecordList` and `HoodieDeleteRecord`, reusing avro type wrappers used by column stats in metadata; - Moves avro type wrappers and util methods to `HoodieAvroUtils` for reuse; - Upgrades log block format version from version 2 to 3; - Changes the logic in `HoodieDeleteBlock` to serialize delete records using Avro and deserialize v3 content using Avro. Deserialization of v2 content still uses Kryo to be backwards compatible; - Adds unit tests of Avro serde of the delete block in `TestHoodieDeleteBlock`. The following shows the benchmark result comparing the serde of Avro (v3) vs Kryo (v2) for a delete block with varying number of deletes. Avro serde is moderately slower than Kryo, except for serializing 1 million entries. The difference is only at 100-millisecond level at most for 1 million entries. The bytes generated by Avro are 10% to 20% larger than that by Kryo. ``` @Benchmark public void serializeDeleteRecords(BenchState bs, Blackhole bh) throws IOException { HoodieDeleteBlock deleteBlock = new HoodieDeleteBlock(bs.deleteRecordArray, new HashMap<>()); deleteBlock.getContentBytes(); } @Benchmark public void deserializeDeleteRecords(BenchState bs, Blackhole bh) throws IOException { HoodieDeleteBlock deleteBlock = new HoodieDeleteBlock( Option.of(bs.deleteContentBytes), null, true, Option.empty(), new HashMap<>(), new HashMap<>()); deleteBlock.getRecordsToDelete(); } ``` | numDeletes | Avro serialization | Kryo serialization | Avro deserialization | Kryo deserialization | Avro bytes | Kryo bytes | | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | |1 |0.001 ± 0.001 ms/op |0.001 ± 0.001 ms/op |0.001 ± 0.001 ms/op |0.001 ± 0.001 ms/op | 82| 316| |10 |0.005 ± 0.001 ms/op |0.003 ± 0.001 ms/op |0.008 ± 0.008 ms/op |0.003 ± 0.001 ms/op | 634| 686| |100 |0.054 ± 0.027 ms/op |0.025 ± 0.003 ms/op |0.060 ± 0.009 ms/op |0.022 ± 0.004 ms/op | 6,171| 5,546| |1000 |0.514 ± 0.071 ms/op |0.357 ± 0.048 ms/op |0.624 ± 0.171 ms/op |0.214 ± 0.044 ms/op | 61,523| 54,146| |10000 |5.665 ± 1.370 ms/op |4.003 ± 1.201 ms/op |6.591 ± 1.152 ms/op |2.051 ± 0.181 ms/op | 614,990| 540,100| |100000 |84.675 ± 31.536 ms/op |86.890 ± 152.199 ms/op |54.503 ± 32.649 ms/op |21.312 ± 3.302 ms/op | 6,149,725| 5,399,763| |1000000 |613.506 ± 85.529 ms/op |1231.790 ± 23.968 ms/op |500.381 ± 50.868 ms/op | 312.019 ± 768.724| 61,496,183| 53,996,274| ### Impact Makes delete blocks deserializable in programming languages other than Java, as Avro is supported by common programming languages. ### Risk level low ### Documentation Update Update delete block format in Hudi Tech Spec and docs: HUDI-6616 ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
