[
https://issues.apache.org/jira/browse/ARROW-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-5129:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21613
> [Rust][Parquet] Column writer bug: check dictionary encoder when adding a new
> data page
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-5129
> URL: https://issues.apache.org/jira/browse/ARROW-5129
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust
> Environment: N/A
> Reporter: Ivan Sadikov
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> As part of my weekly routine, I glanced over code in Parquet column writer
> and found that the way we check when to add a new data page is buggy. The
> idea is checking the current encoder and deciding if we have written enough
> bytes for a page to construct. The problem is that we only check value
> encoder, regardless whether or not dictionary encoder is enabled.
> Here is how we do it now: actual check
> (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L378)
> and the buggy function
> (https://github.com/apache/arrow/blob/master/rust/parquet/src/column/writer.rs#L423).
>
> In the case of sparse column and dictionary encoder we would write a single
> data page, even though we would have accumulated a large enough number of
> bytes for more than one page in encoder (value encoder will be empty, so it
> will always less than constant limit).
> I forgot that parquet-cpp has `current_encoder` as either value encoder or
> dictionary encoder
> (https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_writer.cc#L544),
> but in parquet-rs we have them separate.
> So the fix could be something like this:
> {code}
> /// Returns true if there is enough data for a data page, false otherwise.
> #[inline]
> fn should_add_data_page(&self) -> bool {
> match self.dict_encoder {
> Some(ref encoder) => {
> encoder.estimated_data_encoded_size() >=
> self.props.data_pagesize_limit()
> },
> None => {
> self.encoder.estimated_data_encoded_size() >=
> self.props.data_pagesize_limit()
> }
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)