wgtmac commented on issue #34536: URL: https://github.com/apache/arrow/issues/34536#issuecomment-1467228060
This is interesting. We need to make these parameters configurable via `ColumnProperties` or something before benchmarking. By the way, I really doubt we can reach to a clear answer to this question in the end. The best encoding ratio is the entropy of the data which requires a precise knowledge of probability and distribution of all input words. As the result is highly dependent on the data distribution, we need to define and prepare some datasets with distinct distribution and pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
