Re: [I] Support file row number in Parquet reader [arrow-rs]

via GitHub Thu, 23 Oct 2025 06:32:43 -0700


jkylling commented on issue #7299:
URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3437023778


   > > Copying a comment I made in discord:
   > > I recommend sketching out an "end to end" example that shows how the new 
API would work
   > > For example, make an example similar to this one that shows how you 
would specify reading row numbers and how you would access those row numbers in 
the returned batch 
https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch
   > 
   > Here's an example:
   > 
   > let file = File::open(path).unwrap();
   > 
   > let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
   > 
   > let row_number_field = Field::new(
   >     "my_row_num_col",
   >     ArrowDataType::Int64,
   >     false,
   > )
   > .with_extension_type(RowNumber::default()) // this is required, 
`with_row_number_column` won't accept field without this.
   > .with_metadata(std::collections::HashMap::from([( // optional, just an 
example here
   >     PARQUET_FIELD_ID_META_KEY.to_string(),
   >     "2147483645",
   > )]));
   > 
   > let builder = builder.with_row_number_column(row_number_field);
   > 
   > // row_number_field will be included in the schema, added to the end of 
the list
   > println!("Converted arrow schema is: {}", builder.schema());
   > 
   > let reader = builder.build().unwrap();
   > 
   > let record_batch = reader.next().unwrap().unwrap();
   > 
   > println!("Read {} records.", record_batch.num_rows());
   > 
   > let row_number_col = 
record_batch.column_by_name("my_row_num_col").unwrap();
   > Rough ideas behind this:
   > 
   > * It builds upon the discussion at the PR 
([here](https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885))
   > * New column is part of the schema. That makes the usage much easier, as 
the clients don't need to track this extra column.
   > * Because this is a special column, we need to mark it as such. We use a 
new extension types for this.
   > * Users also get the flexibility of fully specifying the field - name, 
metadata properties, etc.. Type and nullability are going to be asserted 
though. We can provide a helper function to construct this field, to avoid 
having to pass `false` for nullability and `ArrowDataType::Int64`.
   > * To make this field part of the schema, proposal is to use 
`builder.with_row_number_column(field)`. The alternative is to make users 
create full schema and insert this field somewhere in it, but that doesn't seem 
user-friendly always. Rather, `with_row_number_column` would add this field to 
the end of the `fields` list in the schema.
   > * `with_row_number_column` should also modify 
`ArrowReaderBuilder::fields`, to add a new field. I'm not sure what 
`field_type` it should have there. Probably needs a new one, so that the array 
reader builders would build a special array reader, that enumerates row 
positions, and information about extension type would otherwise be lost at this 
point.
   > 
   > Please let me know what you think.
   
   This looks really good!
   
   How about:
   
   ```rust
    // The constructor of the RowNumber extension type is private, so this is 
the only way to create this field. It ensures that the type and nullability is 
always correct.
   let row_number_field = RowNumberField::new("my_row_num_col");
   ```
   
   This can be used with (1) (I believe this pattern is common in engines, even 
if not user friendly):
   
   ```rust
   let supplied_schema = Arc::new(Schema::new(vec![
       row_number_field,
   ]));
   let options = ArrowReaderOptions::new().with_schema(supplied_schema.clone());
   let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
       file,
       options
   ).expect("Error if the schema is not compatible with the parquet file 
schema.");
   ```
   
   and (2)
   
   ```rust
   let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
   builder.with_metadata_columns([row_number_field]);
   ```
   
   Alternatively, we modify (2) to be:
   
   (3)
   ```rust
   let options = 
ArrowReaderOptions::new().with_metadata_columns([row_number_field]);
   let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
       file,
       options
   )
   ```
   
   as this might simplify the changes to `ParquetRecordBatchReaderBuilder` (it 
might be unchanged?).
   
   This would allow us to do https://github.com/apache/arrow-rs/issues/8641 in 
the future without having to change the interface of `ArrowReaderOptions` or 
`ParquetRecordBatchReaderBuilder` further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Support file row number in Parquet reader [arrow-rs]

Reply via email to