etseidl commented on PR #9700:
URL: https://github.com/apache/arrow-rs/pull/9700#issuecomment-4240256987

   Thanks @mzabaluev, I was unaware of this parquet-java behavior. I wonder, 
however, if even the parquet-java version needs to be updated.
   
   I looked at the java code, and it has been around for quite some time (it 
was added in late 2014). At that time, I believe the default page size was on 
the order of a megabyte, so using this heuristic after a single page was 
probably not a bad idea. However, when the page indexes were added, 
parquet-java was modified to by default limit pages to 20000 rows (this crate 
adopted the default 20k page size quite some time later). IMO, 20000 values is 
too small a sample to decide if a dictionary is having a beneficial effect. 
Let's say one has a relatively low cardinality (32k) i64 column with a somewhat 
random distribution. After encoding one 20k row page I think the heuristic here 
will almost certainly choose plain vs dictionary, but if one were to encode 10 
pages, dictionary would then be seen to be superior by far.
   
   I like that this is opt-in, but then wonder if a user knows this heuristic 
will be helpful (i.e. they know it's a high cardinality column), could they not 
instead simply disable dictionary encoding for the column in question.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to