tustvold commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2429736936
Some points in no particular order:
* The parquet schema is authoritative, with any other schema information
merely a hint, this makes the notion of using the arrow schema, or something
else to drive decode a little dubious.
* The record shredding logic for lists is the single most complex, confusing
and subtle aspect of any parquet reader, which:
* Limits the pool of people who can implement / review such changes
* Sets a very high bar for including such changes
* Even some optimal record shredding setup will never perform better than an
implementation that can simply skip it entirely
* Both arrow-rs and polars exploit that the hybrid RLE is effectively a
bitmask if the max definition level is only 1, this allows for very efficient
decode. This isn't possible when there are repetition levels
* Performant record skipping, e.g. for predicate/index pushdown or late
materialization, is not really possible against data with repetition levels
* Many readers have quirky support for repetition levels and lists in
general, especially w.r.t areas where the specification has been ambiguous in
the past, finding ways for people to avoid these pain points is potentially
valuable
That's all to say providing a way to encode fixed size lists seems like a
very useful capability. That being said, it does seem to be a bit of a hack to
make this a logical type, and will potentially limit the options for encodings,
statistics, sort orders, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]