[GitHub] [spark] ggershinsky commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

GitBox Mon, 10 May 2021 01:49:00 -0700


ggershinsky commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836396016



   > I wonder if we should make the row group size a parameter for the 
benchmark too. 
   
   > Shall we change the grouping in order see the trend according to the block 
size? For example,
   > 
   > ```
   > ...
   > Without bloom filter, blocksize: 8388608           1005           1011     
      8         99.5          10.0       1.0X
   > Without bloom filter, blocksize: 9437184            992           1002     
     14        100.8           9.9       1.0X
   > With bloom filter, blocksize: 8388608               385            404     
     20        259.6           3.9       2.6X
   > With bloom filter, blocksize: 9437184               521            538     
     16        191.9           5.2       1.9X
   > ...
   > ```
   
   
   +1 to these proposals. Bloom filtering is done per row group (column chunk), 
so I guess having smaller row groups makes it more efficient. The default 
parquet.block.size is 128MB. It would be interesting to see a table with block 
size values ranging from 1MB to 128MB (or maybe higher), eg in multiples of 2, 
like 1, 2, 4, 8, 16, 32, 64, 128 MB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ggershinsky commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark

Reply via email to