etseidl commented on PR #9700:
URL: https://github.com/apache/arrow-rs/pull/9700#issuecomment-4347683666
Thanks @mzabaluev, I think this is getting close. I've begun thinking ahead
a bit, and want to bounce a few more ideas off you. I think this PR is touching
on a larger issue (#8358) of balancing compression with performance. How do you
feel about the following:
First, modify the `DictionaryFallback` you've added to include more
parameters. So maybe something like
```rust
pub enum DictionaryFallback {
OnPageSizeLimit { max_dictionary_size: i32 },
OnUnfavorableAfter { num_samples: i32, max_dictionary_size: i32 },
}
```
and then deprecate `set_dictionary_page_size_limit` and family in favor of
the new enum. To keep the same behavior you have now, we'd need the max dict
size in that variant as well. (I also think `i32` is the appropriate type since
a Parquet page uses that to indicate the uncompressed page size, so the limit
is 2GB anyway).
In the future, I think it would be nice to have a policy that will fall back
early on bad compression, but not fall back because of dictionary size on good
compression. That policy could have settings for the compression ratio to
decide early fallback, and perhaps a second ratio to decide on whether it's
worth growing the dictionary (e.g. fall back if the compression ratio is > 0.9,
but allow a larger dictionary if the ratio is < 0.5).
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]