[GitHub] [arrow-rs] tustvold edited a comment on pull request #1180: Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171)

GitBox Fri, 21 Jan 2022 13:05:15 -0800


tustvold edited a comment on pull request #1180:
URL: https://github.com/apache/arrow-rs/pull/1180#issuecomment-1018838480



   > next logical question is what's the impact to end-to-end performance of 
queries in Data Fusion
   
   This is highly context dependent, queries that were previously bottlenecked 
by parquet will of course see improvements. A simple table scan with no 
predicates, for example, should see most of the raw 60x performance uplift. 
This sort of "query" shows up in IOx when compacting or loading data from 
object storage into an in-memory cache.
   
   The story for more complex queries is a bit more WIP. Currently Datafusion's 
handling of dictionary encoded arrays isn't brilliant with it often fully 
materializing dictionaries when it shouldn't need to. 
https://github.com/apache/arrow-datafusion/pull/1475 and 
https://github.com/apache/arrow-datafusion/issues/1610 track the process of 
switching DataFusion to delegate comparator selection to arrow-rs, which should 
help to alleviate this.
   
   TLDR at this point in time I'm focused on getting arrow-rs to a good place, 
with the necessary base primitives, and then I'll turn my attention to what 
Datafusion is doing with them. Who knows someone else may even get there first 
:grin: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold edited a comment on pull request #1180: Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171)

Reply via email to