JigaoLuo commented on issue #7723: URL: https://github.com/apache/arrow-rs/issues/7723#issuecomment-3020070831
@XiangpengHao I have one question and one discussion point: Question: - Apologies for being too pedantic, but is the "before-image" case supposed to be 2MB rather than 1MB? I noticed the dictionary size is bounded at ~2049 KB, which aligns more closely with 2MB. Discussion Point: To optimize dictionary encoding, I propose it is necessary to first consider the cardinality of each column in the rowgroup relative to the row count. Specifically: - Track the unique values in a hashset for each column. - Sum up the total size of all unique entries to set a meaningful dictionary page limit. - Then this approach would ensure all strings could be reliably dictionary-encoded rather than stored in raw format. (If we want to avoid PLAIN) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
