[
https://issues.apache.org/jira/browse/PARQUET-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030160#comment-17030160
]
Deepak Majeti commented on PARQUET-1781:
----------------------------------------
Even though the 1.3 writer wrote the "min_value", "max_value" along with the
old "min", "max", the new statistics are not valid since the column order is
not set according to the Parquet spec. In a way, this is a bug in the 1.3
reader to return new stats without verifying the column order. The reader in
1.4 does the right thing.
> [C++] 1.4.0+ reader ignore stats created by 1.3.* writer
> --------------------------------------------------------
>
> Key: PARQUET-1781
> URL: https://issues.apache.org/jira/browse/PARQUET-1781
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Affects Versions: cpp-1.4.0, cpp-1.5.0
> Reporter: Milos Sukovic
> Priority: Major
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> [https://github.com/apache/arrow/commit/d257a88ed612301c0411894dfa783fcbff1bc867]
> In referenced commit, change to metadata.cc file changed the way for checking
> if new stats (min_value/max_value) are used.
> From
> if (metadata.statistics.__isset.max_value ||
> metadata.statistics.__isset.min_value)
> to
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER)
>
> This change is breaking backward compat - all files which contain new stats
> (min_value/max_value), and are created before this change are valid, but they
> do not set column order flag.
> After this change, those stats are ignored, because column order flag is
> checked.
> Possible fix would be something like:
> if (descr->column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER ||
> (version == parquetcpp 1.3.* && (metadata.statistics.__isset.max_value ||
> metadata.statistics.__isset.min_value)))
> I checked parquet-mr, and it seems like there, columnOrder is introduced as
> part of the same change as min_value and max_value, so issue shouldn't happen
> for files created by java code, but probably, stats are ignored by their
> reader too for files created by parquet-cpp 1.3.*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)