gowa commented on PR #3121:
URL: https://github.com/apache/parquet-java/pull/3121#issuecomment-2608733828
Hi @gszadovszky , @wgtmac . Thank you for your feedback.
Yes, I see that it is a big feature and the implementation is far from being
a simple fix. And, maybe, it should be a pluggable thing instead of being a
first-class resident in the code. However, if you feel the changes can be
incorporated into the main codebase, I could try to find someone to review
ByteBuddy part and implement the reader part as well.
As for benchmarks. I've implemented some and committed. I attempted to
replicate the original org.apache.parquet.benchmarks.WriteBenchmarks with some
proto stuff in org.apache.parquet.benchmarks.ProtoWriteBenchmarks.
```
The result are as follows: the bigger number of fields (especially
primitives), the bigger the gain.
E.g. for 100 int32 fields:
Benchmark
(codegenMode) (protoClass) Mode Cnt Score Error Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
OFF Test100Int32 ss 5 13.171 ± 1.206 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
REQUIRED_ALL Test100Int32 ss 5 6.075 ± 1.258 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
OFF Test100Int32 ss 5 13.304 ± 1.497 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
REQUIRED_ALL Test100Int32 ss 5 6.235 ± 0.617 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
OFF Test100Int32 ss 5 13.450 ± 3.429 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
REQUIRED_ALL Test100Int32 ss 5 5.947 ± 0.430 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
OFF Test100Int32 ss 5 13.433 ± 3.879 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
REQUIRED_ALL Test100Int32 ss 5 6.523 ± 2.831 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
OFF Test100Int32 ss 5 13.288 ± 0.429 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
REQUIRED_ALL Test100Int32 ss 5 6.333 ± 0.444 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
OFF Test100Int32 ss 5 13.197 ± 1.396 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
REQUIRED_ALL Test100Int32 ss 5 6.855 ± 2.689 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
OFF Test100Int32 ss 5 13.473 ± 1.930 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
REQUIRED_ALL Test100Int32 ss 5 6.006 ± 0.285 s/op
```
For 30 int32 fields:
```
Benchmark
(codegenMode) (protoClass) Mode Cnt Score Error Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
OFF Test30Int32 ss 5 3.421 ± 1.303 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
REQUIRED_ALL Test30Int32 ss 5 2.410 ± 0.357 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
OFF Test30Int32 ss 5 3.396 ± 0.708 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
REQUIRED_ALL Test30Int32 ss 5 2.362 ± 0.174 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
OFF Test30Int32 ss 5 3.250 ± 0.721 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
REQUIRED_ALL Test30Int32 ss 5 2.310 ± 0.168 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
OFF Test30Int32 ss 5 3.447 ± 0.884 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
REQUIRED_ALL Test30Int32 ss 5 2.416 ± 0.387 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
OFF Test30Int32 ss 5 3.156 ± 0.276 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
REQUIRED_ALL Test30Int32 ss 5 2.514 ± 0.687 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
OFF Test30Int32 ss 5 3.398 ± 0.853 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
REQUIRED_ALL Test30Int32 ss 5 2.501 ± 0.323 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
OFF Test30Int32 ss 5 3.644 ± 3.423 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
REQUIRED_ALL Test30Int32 ss 5 2.384 ± 0.203 s/op
```
For 30 strings ("fieldXX:XX"):
```
Benchmark
(codegenMode) (protoClass) Mode Cnt Score Error Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
OFF Test30String ss 5 9.426 ± 3.621 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed
REQUIRED_ALL Test30String ss 5 8.257 ± 1.113 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
OFF Test30String ss 5 9.848 ± 1.141 s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed
REQUIRED_ALL Test30String ss 5 8.302 ± 1.910 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
OFF Test30String ss 5 10.216 ± 1.843 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed
REQUIRED_ALL Test30String ss 5 8.173 ± 1.419 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
OFF Test30String ss 5 9.940 ± 1.680 s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed
REQUIRED_ALL Test30String ss 5 8.242 ± 1.270 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
OFF Test30String ss 5 9.833 ± 1.010 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP
REQUIRED_ALL Test30String ss 5 8.247 ± 1.284 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
OFF Test30String ss 5 9.638 ± 0.502 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY
REQUIRED_ALL Test30String ss 5 7.935 ± 0.889 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
OFF Test30String ss 5 9.968 ± 1.651 s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed
REQUIRED_ALL Test30String ss 5 8.356 ± 1.319 s/op
```
For 5-7 fields the gain is negligeable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]