yordan-pavlov commented on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-988336528


   @tustvold thank you for looking into this, and for the excellent summary of 
the parquet reader stack.
   
   The main reason for the stalling of work on the `ArrowArrayReader` is that a 
big change happened in my personal life - I had a baby born, and as much as I 
would like to spend more time on this project I have much less free time now. I 
hope that in a few months, I will have some more free time and will be able to 
contribute again. The other reason is that although I was able to make the new 
`ArrowArrayReader` a several times faster for string arrays, and this appears 
to bring some nice performance improvements in total execution time (the old 
`ArrayReader` is slow for string arrays), I was struggling to make it faster in 
all cases for primitive arrays. I had some ideas (e.g. make the column chunk 
context a self referential struct so that a dictionary could be built more 
efficiently from the page buffer by avoiding unnecessary memory copies) but the 
baby came before I could finish that.
   
   Here are my thoughts on preserving dictionary arrays:
   * performance as a result of dictionary array preservation depends very much 
on upstream processing (e.g. can filter methods be implemented that can benefit 
from a dictionary array by e.g. making better use of SIMD, how much of the 
larger query can be processed before unpacking the dictionary) - I tried to do 
some synthetic performance tests to measure the impact of unpacking the 
dictionary at different stages of query processing (including filter operators 
that can make use of the dictionary), but couldn't see (as far as I can 
remember) the performance improvements I was expecting; may be my setup was 
flawed, results might be different with actual code
   * I wonder if any (or both) of the proposed two new config values 
`delimit_row_groups` and `preserve_dictionaries` can be enabled / disabled 
automatically (e.g. based on query, data source, etc.) so that most of the time 
they don't need to be changed; my thinking is the default configuration / 
implementation should work best in most cases and settings should only have to 
be changed very rarely, under very specific circumstances and by someone who 
knows very well what they are doing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to