[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

GitBox Sun, 12 Dec 2021 11:57:44 -0800


tustvold commented on issue #171:
URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991960606



   > my understanding is that the reason usually is that the dictionary grew 
too big
   
   This is my understanding also, with the current defaults "too big" would be 
a dictionary larger than 1MB
   
   > And to have to reconstruct a big dictionary from plain-encoded parquet 
data sounds expensive
   
   I don't disagree with this, this would represent a somewhat degenerate edge 
case. As currently articulated, we will only attempt to preserve the dictionary 
if the arrow schema encoded within the parquet file is for a dictionary type. 
Specifically this means the data that was written was already contained in 
arrow dictionary arrays.
   
   In order to yield RecordBatch with the same schema as was written we would 
therefore need to construct one or more dictionaries anyway. This ticket, or at 
least my interpretation of it, is an optimisation to avoid needing to compute 
new dictionaries where the parquet dictionary can be reused, and if not, to 
avoid fully hydrating the values first as the logic currently does.
   
   > is this something that has to be abstracted / hidden?
   
   In my opinion, yes. If I write a dictionary array, I expect to get one back. 
If for some crazy reason I write a column with a dictionary that exceeds the 
capabilities of the parquet format to store natively, I would expect that to be 
abstracted away. I do not think there being a performance and compression 
penalty for over-sized dictionaries is unreasonable?
   
   _As an aside the maximum dictionary size is configurable, although I'm not 
really sure what the implications of increasing it are_
   
   > we can see that it doesn't care for the number of rows in the record 
batches as long as the record batch iterator doesn't return None.
   
   Indeed, `ArrowReader`, which `ParquetRecordBatchReader` implements, does not 
have the issues that `ArrayReader` does, and this is why I intend to put the 
delimiting logic here. By contrast, `ArrayReader` has an ambiguous termination 
condition as it isn't an `Iterator`, and needs to yield predictable row counts 
to that readers can be composed together for nested types.
   
   > how would the user know that they have to make changes to batch_size 
depending on the use of dictionary encoding in their parquet files (so that 
record batches do not span row groups)
   
   My proposal is for this to be an internal optimisation within 
`ParquetRecordBatchReader` and not something exposed to the users. They would 
just use `ArrowReader` as normal, and receive an iterator of RecordBatch. It 
would just happen that this would internally will drive the various 
`ArrayReader` in such a way to avoid spanning row groups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #171: Implement returning dictionary arrays from parquet reader

Reply via email to