[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1652: ARROW2: Performance benchmark

GitBox Thu, 27 Jan 2022 13:37:31 -0800


jorgecarleitao commented on issue #1652:
URL: 
https://github.com/apache/arrow-datafusion/issues/1652#issuecomment-1023661427



   * arrow-rs: group filter push down
   * arrow2: group filter push down, page filter push down
   
   afaik both support reading and writing dictionary encoded arrays to 
dictionary-encoded column chunks (but neither supports pushdown based on dict 
values atm).
   
   TBH, imo the biggest limiting factor in implementing anything in parquet is 
its lack of documentation - it is just a lot of work to decipher what is being 
requested, and the situation is not improving. For example, I spent a lot of 
time in understanding the encodings, have [a 
PR](https://github.com/apache/parquet-format/pull/170) to try to help future 
folks implementing it, and it has been lingering for ~9 months now. I wish 
parquet was a bit more inspired by e.g. Arrow or ORC in this respect.
   
   Note that datafusion's benchmarks only use "required" / non-nullable data, 
so most optimizations on the null values are not seen from datafusion's 
perspective. Last time I benched [arrow2/parquet2 was much 
faster](https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045)
 in nullable data; I am a bit surprised to see so many differences in 
non-nullable data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #1652: ARROW2: Performance benchmark

Reply via email to