mzabaluev opened a new issue, #9739:
URL: https://github.com/apache/arrow-rs/issues/9739

   The `dict_fallback` method of `GenericColumnWriter`[^1] writes the 
dictionary page to the output even though the conditions for the fallback are 
reached, meaning that the dictionary encoding is unsatisfactory to encode the 
entire column chunk. This presents two minor problems:
   
   1. In the current `should_dict_fallback` logic, the dictionary has met or 
exceeded its page size limit as configured in the column properties. Oversized 
dictionary pages, though not violating any format constraints, may be 
surprising to the user. The data pages of the column chunk are then encoded 
piecemeal using first dictionary, then a fallback encoding, which is again 
legal but weird. More importantly, a larger than expected dictionary may arise 
from high cardinality of the values, so encoding all data pages in fallback may 
result in a more compact encoding.
   2. More fallback strategies may be added in the future, as proposed in #9699 
and implemented in #9700. In such cases, the dictionary encoding is decided to 
be inefficient based on the size of a partial encoding, so it does not make 
sense to write out the first inefficiently encoded pages and then continue on 
the better encoding.
   
   For comparison, the `FallbackValuesWriter` implementation in parquet-java 
extracts all values from the dictionary encoder to be re-encoded by the 
fallback encoder.
   
   [^1]: Creative choice of a name! 😵‍💫


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to