[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

GitBox Wed, 05 May 2021 15:39:51 -0700


autumnust commented on pull request #651:
URL: https://github.com/apache/orc/pull/651#issuecomment-833094987



   Also tried putting the init dict size as a parameter in the benchmark: 
   ```Benchmark                                    (dictImpl)  (initSize)  
(upperBound)  Mode  Cnt       Score       Error  Units
   ORCWriterBenchMark.dictBench                     RBTREE        4096         
10000  avgt    5   40363.137 ± 28516.301  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096          
2500  avgt    5   24392.270 ±  8096.477  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        4096          
 500  avgt    5   20316.419 ±  2917.176  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192         
10000  avgt    5   33112.917 ±  4040.524  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192          
2500  avgt    5   27269.018 ±  1593.083  us/op
   ORCWriterBenchMark.dictBench                     RBTREE        8192          
 500  avgt    5   21326.469 ±  1373.140  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240         
10000  avgt    5   33538.635 ±  5781.321  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240          
2500  avgt    5   27548.238 ±  3307.047  us/op
   ORCWriterBenchMark.dictBench                     RBTREE       10240          
 500  avgt    5   21053.961 ±  3118.891  us/op
   **ORCWriterBenchMark.dictBench                       HASH        4096        
 10000  avgt    5   19614.326 ±  1304.425  us/op**
   ORCWriterBenchMark.dictBench                       HASH        4096          
2500  avgt    5   11529.653 ±  1851.182  us/op
   ORCWriterBenchMark.dictBench                       HASH        4096          
 500  avgt    5    9108.816 ±   837.123  us/op
   **ORCWriterBenchMark.dictBench                       HASH        8192        
 10000  avgt    5   15611.967 ±   583.613  us/op**
   ORCWriterBenchMark.dictBench                       HASH        8192          
2500  avgt    5   13396.318 ±  3114.460  us/op
   ORCWriterBenchMark.dictBench                       HASH        8192          
 500  avgt    5   10742.425 ±  1031.070  us/op
   **ORCWriterBenchMark.dictBench                       HASH       10240        
 10000  avgt    5   17044.101 ±  1671.182  us/op**
   ORCWriterBenchMark.dictBench                       HASH       10240          
2500  avgt    5   13767.572 ±   196.728  us/op
   ORCWriterBenchMark.dictBench                       HASH       10240          
 500  avgt    5   11120.604 ±   305.075  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096         
10000  avgt    5    4327.766 ±  1747.765  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096          
2500  avgt    5    4390.480 ±  2545.236  us/op
   ORCWriterBenchMark.dictBench                       NONE        4096          
 500  avgt    5   12315.912 ±  8071.684  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192         
10000  avgt    5    5529.683 ±  4187.802  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192          
2500  avgt    5    5461.490 ±   914.698  us/op
   ORCWriterBenchMark.dictBench                       NONE        8192          
 500  avgt    5    4745.401 ±  1097.454  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240         
10000  avgt    5    4734.983 ±   257.299  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240          
2500  avgt    5    4776.043 ±   690.286  us/op
   ORCWriterBenchMark.dictBench                       NONE       10240          
 500  avgt    5    4750.625 ±   440.191  us/op```
   
   Obviously it is not the case that the larger the init size the better the 
performance (considering the cost of resizing and the overhead brought from 
this in terms of metadata size). Again the good size will be relevant to the 
common record size seen by the writer (which together with the stripe size 
determines the number of entries within a stripe). From the benchmark above, 
enlarging the size could only be beneficial if the collision is common (e.g. in 
the `HASH` case when upper bound is larger (10k)) but the benefits isn't that 
consistent among different value of `upperBound`. 
   
   I am therefore putting `initSize` as a config in `OrcConf` and keep the 
default as the original. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] autumnust commented on pull request #651: ORC-757: HashTable dictionary

Reply via email to