mapleFU commented on PR #34355: URL: https://github.com/apache/arrow/pull/34355#issuecomment-1445303328
After go through the piece of code, I found that current impl is ok, because we mostly only use statistics on writer. But when `ExtractStatisticsFromPageHeader` or other reader part is in, things will get a bit more complex. As for now, we can assume that: 1. Writer can assure that if has right null-count ( if it not has any bugs ) 2. Currently I found that ndv is never collected. If a user collect ndv in page1, but not collect ndv in page 2, it should be abandon. For reader: 1. When deserialize, reader should assume that ndv and null_count can be unset ( but currently, it doesn't work like this) 2. Deserialized statistics can **not** call merge or other mutation methods Currently, a writer will not has bug on merging. But if a reader, checks `has_null_count` or `ndv`, it will get the wrong result -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
