pitrou opened a new issue, #40636:
URL: https://github.com/apache/arrow/issues/40636

   ### Describe the enhancement requested
   
   Currently, the choice of default encoding for a non-dictionary data page is 
trivial.
   It happens in two places:
   1. in the `FallbackToPlainEncoding` function for columns for which 
dictionary encoding is attempted:
   
https://github.com/apache/arrow/blob/5718a2862b4254d8bf938912d8958837ac7313a5/cpp/src/parquet/column_writer.cc#L1567-L1580
   2. in the `ColumnWriter::Make` factory function for columns for which 
dictionary encoding is not attempted:
   
https://github.com/apache/arrow/blob/5718a2862b4254d8bf938912d8958837ac7313a5/cpp/src/parquet/column_writer.cc#L2375-L2382
   
   I'll note that parquet-mr does not limit dictionary encoding fallback to 
PLAIN, even for "v1" Parquet files:
   
https://github.com/apache/parquet-mr/blob/95b004c3df473e3ab0963dc5136934ce5235d5df/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L124-L139
   
   We should probably consolidate the logic from the two functions above and 
make it more sophisticated, allowing the best encoding for the selected Parquet 
version.
   
   Also related: https://github.com/apache/arrow/issues/38441
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to