daniel-x commented on issue #38441:
URL: https://github.com/apache/arrow/issues/38441#issuecomment-4252835716
I'm interested to work on this issue, as I see high potential in it for
reduced file sizes, which is a strong reason for choosing parquet as a file
format for long term storage - and we would improve io speed at the same time.
I personally work with data where slightly better encoding defaults would
reduce my zstd compressed file sizes by about 20%. Currently, specifying
encoding per-column through the Python API requires a fair bit of manual
configuration, which can be a barrier for new users.
Here is how I think about scoping a first step:
- I suggest targeting simple, low-risk changes that harvest large gains.
- We can set defaults for encoding per column based on
- datatype of the column
- subsequent compression
- version of the parquet file (for compatibility) # can be omitted and
simplified by choosing always-compatible defaults
- (no sampling of column content in this first step, because this is where
it gets tricky)
I believe this will address many of the common cases.
Technically, encoding can be decoupled from subsequent compression, but it
can cause larger file sizes and worse io speed, as demonstrated by the
benchmarks in issue #49715. Hence, I think encoding should take compression
into account in order to get best results, even though it increases code
complexity.
I'd be happy to put together an initial PR if this direction sounds
reasonable. Happy to hear if there are considerations I'm missing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]