scovich commented on code in PR #7307: URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2364598624
########## parquet/src/arrow/array_reader/builder.rs: ########## @@ -52,12 +70,13 @@ fn build_reader( field: &ParquetField, mask: &ProjectionMask, row_groups: &dyn RowGroups, + row_number_column: Option<&str>, Review Comment: > What do other parquet readers do to represent row numbers in their output schema? https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256, posted Apr 15, might be a starting point? > AFAIK, most parquet readers now support row numbers. We can add [DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp) and [Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292) to the ones already mentioned above. Duckdb uses a [column schema type](https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp#L411-L412) approach. Interestingly, that's new -- last time I looked (nearly a year go) it required the reader to pass options along with the schema, and one of the options was to request row numbers (which then became an extra unnamed column at the end of the regular schema). I think that approach didn't scale as they started needing more and more special column types. I see geometry, variant, and non-materialiaed expressions, for example. Iceberg's parquet reader works almost exclusively from field ids, and row index has a baked in field id from the range of metadata row ids. Spark uses a metadata column approach, identified by a special name (`_metadata._rowid`); I don't remember how precisely that maps to the underlying parquet reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org