judahrand commented on PR #37469: URL: https://github.com/apache/arrow/pull/37469#issuecomment-1714355320
> When disable collect statistics, currently it would be hard to collect the `ColumnIndex`, because `ColumnIndex` relies on `Statistics`. Yeah, that makes sense! Would be good to clarify in the docs - will see if I get to it in another PR. > Also, by the way, by default, if column index is enabled, the page header statistics will not be written. (Since spec says if column index exists, page header is not tent to be written) Yeah, what gets written and what doesn't is quite confusing. In fact the spec doesn't techincally say that the page-level statistics when writing the ColumnIndex but that it isn't recommended (one might want both in order to support old readers). https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach It's probably sensible default behaviour but it'd be nice to force being able to write both. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
