ggershinsky commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-836396016
> I wonder if we should make the row group size a parameter for the
benchmark too.
> Shall we change the grouping in order see the trend according to the block
size? For example,
>
> ```
> ...
> Without bloom filter, blocksize: 8388608 1005 1011
8 99.5 10.0 1.0X
> Without bloom filter, blocksize: 9437184 992 1002
14 100.8 9.9 1.0X
> With bloom filter, blocksize: 8388608 385 404
20 259.6 3.9 2.6X
> With bloom filter, blocksize: 9437184 521 538
16 191.9 5.2 1.9X
> ...
> ```
+1 to these proposals. Bloom filtering is done per row group (column chunk),
so I guess having smaller row groups makes it more efficient. The default
parquet.block.size is 128MB. It would be interesting to see a table with block
size values ranging from 1MB to 128MB (or maybe higher), eg in multiples of 2,
like 1, 2, 4, 8, 16, 32, 64, 128 MB.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]