tustvold commented on issue #171: URL: https://github.com/apache/arrow-rs/issues/171#issuecomment-991960606
> my understanding is that the reason usually is that the dictionary grew too big This is my understanding also, with the current defaults "too big" would be a dictionary larger than 1MB > And to have to reconstruct a big dictionary from plain-encoded parquet data sounds expensive I don't disagree with this, this would represent a somewhat degenerate edge case. As currently articulated, we will only attempt to preserve the dictionary if the arrow schema encoded within the parquet file is for a dictionary type. Specifically this means the data that was written was already contained in arrow dictionary arrays. In order to yield RecordBatch with the same schema as was written we would therefore need to construct one or more dictionaries anyway. This ticket, or at least my interpretation of it, is an optimisation to avoid needing to compute new dictionaries where the parquet dictionary can be reused, and if not, to avoid fully hydrating the values first as the logic currently does. > is this something that has to be abstracted / hidden? In my opinion, yes. If I write a dictionary array, I expect to get one back. If for some crazy reason I write a column with a dictionary that exceeds the capabilities of the parquet format to store natively, I would expect that to be abstracted away. I do not think there being a performance and compression penalty for over-sized dictionaries is unreasonable? _As an aside the maximum dictionary size is configurable, although I'm not really sure what the implications of increasing it are_ > we can see that it doesn't care for the number of rows in the record batches as long as the record batch iterator doesn't return None. Indeed, `ArrowReader`, which `ParquetRecordBatchReader` implements, does not have the issues that `ArrayReader` does, and this is why I intend to put the delimiting logic here. By contrast, `ArrayReader` has an ambiguous termination condition as it isn't an `Iterator`, and needs to yield predictable row counts to that readers can be composed together for nested types. > how would the user know that they have to make changes to batch_size depending on the use of dictionary encoding in their parquet files (so that record batches do not span row groups) My proposal is for this to be an internal optimisation within `ParquetRecordBatchReader` and not something exposed to the users. They would just use `ArrowReader` as normal, and receive an iterator of RecordBatch. It would just happen that this would internally will drive the various `ArrayReader` in such a way to avoid spanning row groups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
