pgaref commented on pull request #651: URL: https://github.com/apache/orc/pull/651#issuecomment-828353933
> Here is the new jmh result: > > ``` > Benchmark (dictImpl) (upperBound) Mode Cnt Score Error Units > ORCWriterBenchMark.dictBench RBTREE 10000 avgt 5 28939.068 ± 3080.947 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 10000 avgt 5 49.963 # > ORCWriterBenchMark.dictBench:ops RBTREE 10000 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord RBTREE 10000 avgt 5 0.883 ± 0.094 us/op > ORCWriterBenchMark.dictBench:records RBTREE 10000 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench RBTREE 2500 avgt 5 21998.781 ± 1448.300 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 2500 avgt 5 23.532 # > ORCWriterBenchMark.dictBench:ops RBTREE 2500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord RBTREE 2500 avgt 5 0.671 ± 0.044 us/op > ORCWriterBenchMark.dictBench:records RBTREE 2500 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench RBTREE 500 avgt 5 17730.281 ± 4574.132 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord RBTREE 500 avgt 5 13.156 # > ORCWriterBenchMark.dictBench:ops RBTREE 500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord RBTREE 500 avgt 5 0.541 ± 0.140 us/op > ORCWriterBenchMark.dictBench:records RBTREE 500 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench HASH 10000 avgt 5 21269.613 ± 4137.763 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord HASH 10000 avgt 5 42.268 # > ORCWriterBenchMark.dictBench:ops HASH 10000 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord HASH 10000 avgt 5 0.649 ± 0.126 us/op > ORCWriterBenchMark.dictBench:records HASH 10000 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench HASH 2500 avgt 5 11586.898 ± 4075.783 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord HASH 2500 avgt 5 17.692 # > ORCWriterBenchMark.dictBench:ops HASH 2500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord HASH 2500 avgt 5 0.354 ± 0.124 us/op > ORCWriterBenchMark.dictBench:records HASH 2500 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench HASH 500 avgt 5 9646.080 ± 2279.530 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord HASH 500 avgt 5 11.613 # > ORCWriterBenchMark.dictBench:ops HASH 500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord HASH 500 avgt 5 0.294 ± 0.070 us/op > ORCWriterBenchMark.dictBench:records HASH 500 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench NONE 10000 avgt 5 4077.675 ± 117.606 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord NONE 10000 avgt 5 50.146 # > ORCWriterBenchMark.dictBench:ops NONE 10000 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord NONE 10000 avgt 5 0.124 ± 0.004 us/op > ORCWriterBenchMark.dictBench:records NONE 10000 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench NONE 2500 avgt 5 4607.634 ± 1163.084 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord NONE 2500 avgt 5 50.146 # > ORCWriterBenchMark.dictBench:ops NONE 2500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord NONE 2500 avgt 5 0.141 ± 0.035 us/op > ORCWriterBenchMark.dictBench:records NONE 2500 avgt 5 163840.000 # > ORCWriterBenchMark.dictBench NONE 500 avgt 5 3783.059 ± 367.511 us/op > ORCWriterBenchMark.dictBench:bytesPerRecord NONE 500 avgt 5 50.146 # > ORCWriterBenchMark.dictBench:ops NONE 500 avgt 5 ≈ 0 # > ORCWriterBenchMark.dictBench:perRecord NONE 500 avgt 5 0.115 ± 0.011 us/op > ORCWriterBenchMark.dictBench:records NONE 500 avgt 5 163840.000 # > ``` > > Unfortunately the previous implementation had a bug which end up with great locality (but incorrect). HASH is still much better than RB-Tree but obviously we needs to iterate further to improve it. Sounds like this can be improved by reducing collisions right? Since we have an upper bound on the number of entries per batch would it make sense to experiment a but more with HT sizes to reduce collisions as much as possible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
