vustef commented on issue #7299: URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3434138532
> Copying a comment I made in discord: > > I recommend sketching out an "end to end" example that shows how the new API would work > > For example, make an example similar to this one that shows how you would specify reading row numbers and how you would access those row numbers in the returned batch https://docs.rs/parquet/latest/parquet/arrow/index.html#example-reading-parquet-file-into-arrow-recordbatch Here's an example: ```rust let file = File::open(path).unwrap(); let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap(); let row_number_field = Field::new( "my_row_num_col", ArrowDataType::Int64, false, ) .with_extension_type(RowNumber::default()) // this is required, `with_row_number_column` won't accept field without this. .with_metadata(std::collections::HashMap::from([( // optional, just an example here PARQUET_FIELD_ID_META_KEY.to_string(), "2147483645", )])); let builder = builder.with_row_number_column(row_number_field); // row_number_field will be included in the schema, added to the end of the list println!("Converted arrow schema is: {}", builder.schema()); let reader = builder.build().unwrap(); let record_batch = reader.next().unwrap().unwrap(); println!("Read {} records.", record_batch.num_rows()); ``` Rough ideas behind this: * It builds upon the discussion at the PR ([here](https://github.com/apache/arrow-rs/pull/7307#discussion_r2072470885)) * New column is part of the schema. That makes the usage much easier, as the clients don't need to track this extra column. * Because this is a special column, we need to mark it as such. We use a new extension types for this. * Users also get the flexibility of fully specifying the field - name, metadata properties, etc.. Type and nullability are going to be asserted though. We can provide a helper function to construct this field, to avoid having to pass `false` for nullability and `ArrowDataType::Int64`. * To make this field part of the schema, proposal is to use `builder.with_row_number_column(field)`. The alternative is to make users create full schema and insert this field somewhere in it, but that doesn't seem user-friendly always. Rather, `with_row_number_column` would add this field to the end of the `fields` list in the schema. * `with_row_number_column` should also modify `ArrowReaderBuilder::fields`, to add a new field. I'm not sure what `field_type` it should have there. Probably needs a new one, so that the array reader builders would build a special array reader, that enumerates row positions, and information about extension type would otherwise be lost at this point. Please let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
