adriangb commented on PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#issuecomment-4158477486

   > Hey @adriangb, cool idea. What motivated this if you don't mind me asking? 
Are any other Parquet implementations doing this?
   
   My motivation was that looking at our data this is a consistent problem: we 
have high cardinality data (trace ids) that when packed into 1M row row groups 
saturate the bloom filters (making them useless) but also waste a ton of space 
in small files. In looking for a solution I came across this neat trick.
   
   I don't know if other Parquet implementations use this, but TimescaleDB does 
(linked above).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to