Re: [PR] Write Bloom filters between row groups instead of the end [arrow-rs]

via GitHub Mon, 10 Jun 2024 13:05:45 -0700


alamb commented on PR #5860:
URL: https://github.com/apache/arrow-rs/pull/5860#issuecomment-2159190734


   Thank you @progval 
   
   cc @Ted-Jiang  and @jimexist
   
   I think there is a tradeoff:
   * Writing all the bloom filters at the end  requires them to be buffered 
(which you point out)
   * Writing all the bloom filters at the end  means they are contiguous and 
thus the reader can fetch multiple bloom filters in a single IO (which is 
important if reading from something like `S3`)
   
   Thus given there is a tradeoff it seems like we should at least offer an 
config setting of where to write the bloom filters. 
   
   I don't know if the parquet bloom filter spec dictates where the bloom 
filters should be written or if the ecosystem (aka paruqet-java) implicity 
requires them in a particular location


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Write Bloom filters between row groups instead of the end [arrow-rs]

Reply via email to