etseidl commented on PR #9700: URL: https://github.com/apache/arrow-rs/pull/9700#issuecomment-4240256987
Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, however, if even the parquet-java version needs to be updated. I looked at the java code, and it has been around for quite some time (it was added in late 2014). At that time, I believe the default page size was on the order of a megabyte, so using this heuristic after a single page was probably not a bad idea. However, when the page indexes were added, parquet-java was modified to by default limit pages to 20000 rows (this crate adopted the default 20k page size quite some time later). IMO, 20000 values is too small a sample to decide if a dictionary is having a beneficial effect. Let's say one has a relatively low cardinality (32k) i64 column with a somewhat random distribution. After encoding one 20k row page I think the heuristic here will almost certainly choose plain vs dictionary, but if one were to encode 10 pages, dictionary would then be seen to be superior by far. I like that this is opt-in, but then wonder if a user knows this heuristic will be helpful (i.e. they know it's a high cardinality column), could they not instead simply disable dictionary encoding for the column in question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
