jkylling commented on issue #7299: URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3437023778
> > Copying a comment I made in discord: > > I recommend sketching out an "end to end" example that shows how the new API would work > > For example, make an example similar to this one that shows how you would specify reading row numbers and how you would access those row numbers in the returned batch https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch > > Here's an example: > > let file = File::open(path).unwrap(); > > let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap(); > > let row_number_field = Field::new( > "my_row_num_col", > ArrowDataType::Int64, > false, > ) > .with_extension_type(RowNumber::default()) // this is required, `with_row_number_column` won't accept field without this. > .with_metadata(std::collections::HashMap::from([( // optional, just an example here > PARQUET_FIELD_ID_META_KEY.to_string(), > "2147483645", > )])); > > let builder = builder.with_row_number_column(row_number_field); > > // row_number_field will be included in the schema, added to the end of the list > println!("Converted arrow schema is: {}", builder.schema()); > > let reader = builder.build().unwrap(); > > let record_batch = reader.next().unwrap().unwrap(); > > println!("Read {} records.", record_batch.num_rows()); > > let row_number_col = record_batch.column_by_name("my_row_num_col").unwrap(); > Rough ideas behind this: > > * It builds upon the discussion at the PR ([here](https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885)) > * New column is part of the schema. That makes the usage much easier, as the clients don't need to track this extra column. > * Because this is a special column, we need to mark it as such. We use a new extension types for this. > * Users also get the flexibility of fully specifying the field - name, metadata properties, etc.. Type and nullability are going to be asserted though. We can provide a helper function to construct this field, to avoid having to pass `false` for nullability and `ArrowDataType::Int64`. > * To make this field part of the schema, proposal is to use `builder.with_row_number_column(field)`. The alternative is to make users create full schema and insert this field somewhere in it, but that doesn't seem user-friendly always. Rather, `with_row_number_column` would add this field to the end of the `fields` list in the schema. > * `with_row_number_column` should also modify `ArrowReaderBuilder::fields`, to add a new field. I'm not sure what `field_type` it should have there. Probably needs a new one, so that the array reader builders would build a special array reader, that enumerates row positions, and information about extension type would otherwise be lost at this point. > > Please let me know what you think. This looks really good! How about: ```rust // The constructor of the RowNumber extension type is private, so this is the only way to create this field. It ensures that the type and nullability is always correct. let row_number_field = RowNumberField::new("my_row_num_col"); ``` This can be used with (1) (I believe this pattern is common in engines, even if not user friendly): ```rust let supplied_schema = Arc::new(Schema::new(vec![ row_number_field, ])); let options = ArrowReaderOptions::new().with_schema(supplied_schema.clone()); let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options( file, options ).expect("Error if the schema is not compatible with the parquet file schema."); ``` and (2) ```rust let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap(); builder.with_metadata_columns([row_number_field]); ``` Alternatively, we modify (2) to be: (3) ```rust let options = ArrowReaderOptions::new().with_metadata_columns([row_number_field]); let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options( file, options ) ``` as this might simplify the changes to `ParquetRecordBatchReaderBuilder` (it might be unchanged?). This would allow us to do https://github.com/apache/arrow-rs/issues/8641 in the future without having to change the interface of `ArrowReaderOptions` or `ParquetRecordBatchReaderBuilder` further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
