yihua opened a new pull request, #9315:
URL: https://github.com/apache/hudi/pull/9315

   ### Change Logs
   
   This PR changes the format of delete log blocks and upgrades the log format 
from version 2 to 3.  After this PR, the delete records (record key, partition 
path, and ordering value) are serialized in Avro and then written into the 
delete log block, instead using Kyro which is a serialization framework for 
Java.  This makes delete blocks deserializable in programming languages other 
than Java, as Avro is supported by common programming languages such as 
[Java](https://github.com/apache/avro/tree/master/lang/java), 
[C/C++](https://github.com/apache/avro/tree/master/lang/c), 
[Python](https://github.com/apache/avro/tree/master/lang/py), and 
[Rust](https://github.com/apache/avro/tree/master/lang/rust).
   
   Changes are:
   - New avro schema for delete record: `HoodieDeleteRecordList` and 
`HoodieDeleteRecord`, reusing avro type wrappers used by column stats in 
metadata;
   - Moves avro type wrappers and util methods to `HoodieAvroUtils` for reuse;
   - Upgrades log block format version from version 2 to 3;
   - Changes the logic in `HoodieDeleteBlock` to serialize delete records using 
Avro and deserialize v3 content using Avro.  Deserialization of v2 content 
still uses Kryo to be backwards compatible;
   - Adds unit tests of Avro serde of the delete block in 
`TestHoodieDeleteBlock`.
   
   The following shows the benchmark result comparing the serde of Avro (v3) vs 
Kryo (v2) for a delete block with varying number of deletes.  Avro serde is 
moderately slower than Kryo, except for serializing 1 million entries.  The 
difference is only at 100-millisecond level at most for 1 million entries.  The 
bytes generated by Avro are 10% to 20% larger than that by Kryo.
   
   ```
   @Benchmark
     public void serializeDeleteRecords(BenchState bs, Blackhole bh) throws 
IOException {
       HoodieDeleteBlock deleteBlock = new 
HoodieDeleteBlock(bs.deleteRecordArray, new HashMap<>());
       deleteBlock.getContentBytes();
     }
   
     @Benchmark
     public void deserializeDeleteRecords(BenchState bs, Blackhole bh) throws 
IOException {
       HoodieDeleteBlock deleteBlock = new HoodieDeleteBlock(
           Option.of(bs.deleteContentBytes), null, true, Option.empty(), new 
HashMap<>(), new HashMap<>());
       deleteBlock.getRecordsToDelete();
     }
   ```
   
   | numDeletes | Avro serialization | Kryo serialization | Avro 
deserialization | Kryo deserialization | Avro bytes | Kryo bytes |
   | ----------- | ----------- | ----------- | ----------- | ----------- | 
----------- | ----------- |
   |1 |0.001 ±  0.001  ms/op |0.001 ±   0.001  ms/op |0.001 ±  0.001  ms/op 
|0.001 ±   0.001  ms/op | 82| 316|
   |10 |0.005 ±  0.001  ms/op |0.003 ±   0.001  ms/op |0.008 ±  0.008  ms/op 
|0.003 ±   0.001  ms/op | 634| 686|
   |100 |0.054 ±  0.027  ms/op |0.025 ±   0.003  ms/op |0.060 ±  0.009  ms/op 
|0.022 ±   0.004  ms/op | 6,171| 5,546|
   |1000 |0.514 ±  0.071  ms/op |0.357 ±   0.048  ms/op |0.624 ±  0.171  ms/op 
|0.214 ±   0.044  ms/op | 61,523| 54,146|
   |10000 |5.665 ±  1.370  ms/op |4.003 ±   1.201  ms/op |6.591 ±  1.152  ms/op 
|2.051 ±   0.181  ms/op | 614,990| 540,100|
   |100000 |84.675 ± 31.536  ms/op |86.890 ± 152.199  ms/op |54.503 ± 32.649  
ms/op |21.312 ±   3.302  ms/op | 6,149,725| 5,399,763|
   |1000000 |613.506 ± 85.529  ms/op |1231.790 ±  23.968  ms/op |500.381 ± 
50.868  ms/op | 312.019 ± 768.724| 61,496,183| 53,996,274|
   
   ### Impact
   
   Makes delete blocks deserializable in programming languages other than Java, 
as Avro is supported by common programming languages.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   Update delete block format in Hudi Tech Spec and docs: HUDI-6616
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to