mzabaluev opened a new issue, #9739: URL: https://github.com/apache/arrow-rs/issues/9739
The `dict_fallback` method of `GenericColumnWriter`[^1] writes the dictionary page to the output even though the conditions for the fallback are reached, meaning that the dictionary encoding is unsatisfactory to encode the entire column chunk. This presents two minor problems: 1. In the current `should_dict_fallback` logic, the dictionary has met or exceeded its page size limit as configured in the column properties. Oversized dictionary pages, though not violating any format constraints, may be surprising to the user. The data pages of the column chunk are then encoded piecemeal using first dictionary, then a fallback encoding, which is again legal but weird. More importantly, a larger than expected dictionary may arise from high cardinality of the values, so encoding all data pages in fallback may result in a more compact encoding. 2. More fallback strategies may be added in the future, as proposed in #9699 and implemented in #9700. In such cases, the dictionary encoding is decided to be inefficient based on the size of a partial encoding, so it does not make sense to write out the first inefficiently encoded pages and then continue on the better encoding. For comparison, the `FallbackValuesWriter` implementation in parquet-java extracts all values from the dictionary encoder to be re-encoded by the fallback encoder. [^1]: Creative choice of a name! 😵💫 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
