[GitHub] [arrow] mapleFU commented on pull request #34355: GH-34351: [C++][Parquet] Statistic: tiny optimization

via GitHub Sun, 26 Feb 2023 01:03:26 -0800


mapleFU commented on PR #34355:
URL: https://github.com/apache/arrow/pull/34355#issuecomment-1445303328


   After go through the piece of code, I found that current impl is ok, because 
we mostly only use statistics on writer. But when 
`ExtractStatisticsFromPageHeader` or other reader part is in, things will get a 
bit more complex.
   
   As for now, we can assume that:
   
   1. Writer can assure that if has right null-count ( if it not has any bugs )
   2. Currently I found that ndv is never collected. If a user collect ndv in 
page1, but not collect ndv in page 2, it should be abandon.
   
   For reader:
   
   1. When deserialize, reader should assume that ndv and null_count can be 
unset ( but currently, it doesn't work like this)
   2. Deserialized statistics can **not** call merge or other mutation methods
   
   Currently, a writer will not has bug on merging. But if a reader, checks 
`has_null_count` or `ndv`, it will get the wrong result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #34355: GH-34351: [C++][Parquet] Statistic: tiny optimization

Reply via email to