autumnust commented on pull request #651: URL: https://github.com/apache/orc/pull/651#issuecomment-833094987
Also tried putting the init dict size as a parameter in the benchmark: ```Benchmark (dictImpl) (initSize) (upperBound) Mode Cnt Score Error Units ORCWriterBenchMark.dictBench RBTREE 4096 10000 avgt 5 40363.137 ± 28516.301 us/op ORCWriterBenchMark.dictBench RBTREE 4096 2500 avgt 5 24392.270 ± 8096.477 us/op ORCWriterBenchMark.dictBench RBTREE 4096 500 avgt 5 20316.419 ± 2917.176 us/op ORCWriterBenchMark.dictBench RBTREE 8192 10000 avgt 5 33112.917 ± 4040.524 us/op ORCWriterBenchMark.dictBench RBTREE 8192 2500 avgt 5 27269.018 ± 1593.083 us/op ORCWriterBenchMark.dictBench RBTREE 8192 500 avgt 5 21326.469 ± 1373.140 us/op ORCWriterBenchMark.dictBench RBTREE 10240 10000 avgt 5 33538.635 ± 5781.321 us/op ORCWriterBenchMark.dictBench RBTREE 10240 2500 avgt 5 27548.238 ± 3307.047 us/op ORCWriterBenchMark.dictBench RBTREE 10240 500 avgt 5 21053.961 ± 3118.891 us/op **ORCWriterBenchMark.dictBench HASH 4096 10000 avgt 5 19614.326 ± 1304.425 us/op** ORCWriterBenchMark.dictBench HASH 4096 2500 avgt 5 11529.653 ± 1851.182 us/op ORCWriterBenchMark.dictBench HASH 4096 500 avgt 5 9108.816 ± 837.123 us/op **ORCWriterBenchMark.dictBench HASH 8192 10000 avgt 5 15611.967 ± 583.613 us/op** ORCWriterBenchMark.dictBench HASH 8192 2500 avgt 5 13396.318 ± 3114.460 us/op ORCWriterBenchMark.dictBench HASH 8192 500 avgt 5 10742.425 ± 1031.070 us/op **ORCWriterBenchMark.dictBench HASH 10240 10000 avgt 5 17044.101 ± 1671.182 us/op** ORCWriterBenchMark.dictBench HASH 10240 2500 avgt 5 13767.572 ± 196.728 us/op ORCWriterBenchMark.dictBench HASH 10240 500 avgt 5 11120.604 ± 305.075 us/op ORCWriterBenchMark.dictBench NONE 4096 10000 avgt 5 4327.766 ± 1747.765 us/op ORCWriterBenchMark.dictBench NONE 4096 2500 avgt 5 4390.480 ± 2545.236 us/op ORCWriterBenchMark.dictBench NONE 4096 500 avgt 5 12315.912 ± 8071.684 us/op ORCWriterBenchMark.dictBench NONE 8192 10000 avgt 5 5529.683 ± 4187.802 us/op ORCWriterBenchMark.dictBench NONE 8192 2500 avgt 5 5461.490 ± 914.698 us/op ORCWriterBenchMark.dictBench NONE 8192 500 avgt 5 4745.401 ± 1097.454 us/op ORCWriterBenchMark.dictBench NONE 10240 10000 avgt 5 4734.983 ± 257.299 us/op ORCWriterBenchMark.dictBench NONE 10240 2500 avgt 5 4776.043 ± 690.286 us/op ORCWriterBenchMark.dictBench NONE 10240 500 avgt 5 4750.625 ± 440.191 us/op``` Obviously it is not the case that the larger the init size the better the performance (considering the cost of resizing and the overhead brought from this in terms of metadata size). Again the good size will be relevant to the common record size seen by the writer (which together with the stripe size determines the number of entries within a stripe). From the benchmark above, enlarging the size could only be beneficial if the collision is common (e.g. in the `HASH` case when upper bound is larger (10k)) but the benefits isn't that consistent among different value of `upperBound`. I am therefore putting `initSize` as a config in `OrcConf` and keep the default as the original. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
