tcrasset commented on issue #38877: URL: https://github.com/apache/arrow/issues/38877#issuecomment-1825926099
Thank you for your notes @mapleFU. I need a clarification though: > For "accross multiple ColumnChunkMetadata", in fact, the Statistics only work for one column-chunk. We cannot regard it as a whole-file distinct-count. As I understand it from [the spec](https://parquet.apache.org/docs/file-format/), a file consists of one or more RowGroups, which contain one or more ColumnChunks. I understand we cannot regard it as a whole file distinct count (as in the distinct count of all the columns combined), but is it a per-column distinct count, or a per-column-**chunk** distinct count? You seem to say it's a per-column-chunk, but I want to be sure I understand correctly. ```text +-------+--------+ | col_1 | col2_2 | +-------+--------+ | a | b | | x | b | ================== Row group boundary | b | d | | x | d | +-------+--------+ ``` Here we have 4 column chunks. It is ```text <Column col_1 Chunk 1 + Column Metadata> --> distinct_count = 2 ("a", "x") <Column col_2 Chunk 1 + Column Metadata> --> distinct_count = 1 ("c") <Column col_1 Chunk 2 + Column Metadata> --> distinct_count = 2 ("b", "x") <Column col_2 Chunk 2 + Column Metadata> --> distinct_count = 1 ("d") ``` then, right? But the actual distinct count of col_1 is 3, so we cannot add them up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
