[I] [Parquet] Encoding configuration should be easier and more automated [arrow]

via GitHub Tue, 24 Oct 2023 09:39:02 -0700


pitrou opened a new issue, #38441:
URL: https://github.com/apache/arrow/issues/38441


   ### Describe the enhancement requested
   
   Right now, configuring good encoding values is difficult for users. There is 
nothing to help them make those decisions, and the defaults are a bit 
simplistic (try RLE_DICTIONARY then fall back on PLAIN, IIUC). If they want to 
override encodings, they have to do so on a column-by-column basis (which 
probably becomes very cumbersome if there hundreds of columns).
   
   Ideally, there should be a way for users to get an automatic selection of 
encodings, based on their data or at least their data types (and also the 
selected Parquet version), that provides a good compromise between disk 
footprint and decoding speed.
   
   (in Python, think `pq.write_table(..., column_encoding="auto")`)
   
   Perhaps it would be also nice for users to pass per-datatype preferences, 
rather than per-column.
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Parquet] Encoding configuration should be easier and more automated [arrow]

Reply via email to