daniel-x commented on issue #38441:
URL: https://github.com/apache/arrow/issues/38441#issuecomment-4252835716

   I'm interested to work on this issue, as I see high potential in it for 
reduced file sizes, which is a strong reason for choosing parquet as a file 
format for long term storage - and we would improve io speed at the same time. 
I personally work with data where slightly better encoding defaults would 
reduce my zstd compressed file sizes by about 20%. Currently, specifying 
encoding per-column through the Python API requires a fair bit of manual 
configuration, which can be a barrier for new users.
   
   Here is how I think about scoping a first step:
   - I suggest targeting simple, low-risk changes that harvest large gains.
   - We can set defaults for encoding per column based on
     - datatype of the column
     - subsequent compression
     - version of the parquet file (for compatibility)     # can be omitted and 
simplified by choosing always-compatible defaults
     - (no sampling of column content in this first step, because this is where 
it gets tricky)
   
   I believe this will address many of the common cases.
   
   Technically, encoding can be decoupled from subsequent compression, but it 
can cause larger file sizes and worse io speed, as demonstrated by the 
benchmarks in issue #49715. Hence, I think encoding should take compression 
into account in order to get best results, even though it increases code 
complexity.
   
   I'd be happy to put together an initial PR if this direction sounds 
reasonable. Happy to hear if there are considerations I'm missing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to