Re: [I] Support file row number in Parquet reader [arrow-rs]

via GitHub Wed, 22 Oct 2025 13:41:03 -0700


vustef commented on issue #7299:
URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3434138532


   > Copying a comment I made in discord:
   > 
   > I recommend sketching out an "end to end" example that shows how the new 
API would work
   > 
   > For example, make an example similar to this one that shows how you would 
specify reading row numbers and how you would access those row numbers in the 
returned batch 
https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch
   
   Here's an example:
   ```rust
   let file = File::open(path).unwrap();
   
   let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
   
   let row_number_field = Field::new(
       "my_row_num_col",
       ArrowDataType::Int64,
       false,
   )
   .with_extension_type(RowNumber::default()) // this is required, 
`with_row_number_column` won't accept field without this.
   .with_metadata(std::collections::HashMap::from([( // optional, just an 
example here
       PARQUET_FIELD_ID_META_KEY.to_string(),
       "2147483645",
   )]));
   
   let builder = builder.with_row_number_column(row_number_field);
   
   // row_number_field will be included in the schema, added to the end of the 
list
   println!("Converted arrow schema is: {}", builder.schema());
   
   let reader = builder.build().unwrap();
   
   let record_batch = reader.next().unwrap().unwrap();
   
   println!("Read {} records.", record_batch.num_rows());
   ```
   
   Rough ideas behind this:
   * It builds upon the discussion at the PR 
([here](https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885))
   * New column is part of the schema. That makes the usage much easier, as the 
clients don't need to track this extra column.
   * Because this is a special column, we need to mark it as such. We use a new 
extension types for this. 
   * Users also get the flexibility of fully specifying the field - name, 
metadata properties, etc.. Type and nullability are going to be asserted 
though. We can provide a helper function to construct this field, to avoid 
having to pass `false` for nullability and `ArrowDataType::Int64`.
   * To make this field part of the schema, proposal is to use 
`builder.with_row_number_column(field)`. The alternative is to make users 
create full schema and insert this field somewhere in it, but that doesn't seem 
user-friendly always. Rather, `with_row_number_column` would add this field to 
the end of the `fields` list in the schema.
   * `with_row_number_column` should also modify `ArrowReaderBuilder::fields`, 
to add a new field. I'm not sure what `field_type` it should have there. 
Probably needs a new one, so that the array reader builders would build a 
special array reader, that enumerates row positions, and information about 
extension type would otherwise be lost at this point.
   
   Please let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Support file row number in Parquet reader [arrow-rs]

Reply via email to