scovich commented on code in PR #7307:
URL: https://github.com/apache/arrow-rs/pull/7307#discussion_r2364598624


##########
parquet/src/arrow/array_reader/builder.rs:
##########
@@ -52,12 +70,13 @@ fn build_reader(
     field: &ParquetField,
     mask: &ProjectionMask,
     row_groups: &dyn RowGroups,
+    row_number_column: Option<&str>,

Review Comment:
   > What do other parquet readers do to represent row numbers in their output 
schema?
   
   https://github.com/apache/arrow-rs/pull/7307#issuecomment-2808130256, posted 
Apr 15, might be a starting point?
   
   > AFAIK, most parquet readers now support row numbers. We can add 
[DuckDB](https://github.com/duckdb/duckdb/blob/main/extension/parquet/include/reader/row_number_column_reader.hpp)
 and 
[Iceberg](https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java#L292)
 to the ones already mentioned above. 
   
   Duckdb uses a [column schema 
type](https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_reader.cpp#L411-L412)
 approach. Interestingly, that's new -- last time I looked (nearly a year go) 
it required the reader to pass options along with the schema, and one of the 
options was to request row numbers (which then became an extra unnamed column 
at the end of the regular schema). I think that approach didn't scale as they 
started needing more and more special column types. I see geometry, variant, 
and non-materialiaed expressions, for example.
   
   Iceberg's parquet reader works almost exclusively from field ids, and row 
index has a baked in field id from the range of metadata row ids. 
   
   Spark uses a metadata column approach, identified by a special name 
(`_metadata._rowid`); I don't remember how precisely that maps to the 
underlying parquet reader.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to