jkylling commented on issue #7299: URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3437532690
> [@jkylling](https://github.com/jkylling) I like the idea of generalizing with `with_metadata_columns` instead of `with_row_number_column`. We need some sort of validation that RowNumber is a metadata column, I hope extension types implementation can have that. > > (1) is good, but it's not very user-friendly for "SELECT *" kind of scans. Could we get easily get schema and extend it with a new column? > > On second though, perhaps it's good to separate metadata columns from `with_schema`, i.e. to make it impossible to pass them there, but only through `with_metadata_columns`. Behind the scenes, they should be unified in a single `schema`, but they don't overwrite each other (this means that the builder has to keep track of these options separately, so that if any of them is updated, the combined schema is correctly updated). The downside here is that it has to have some default semantics where these metadata columns are added, e.g. at the end of regulard columns. I think we should allow sending it in the regular schema. From the point of view of the user of the `ParquetRecordBatchReader` it should look like the Parquet file has a physical row number column. This is based on experimenting with the PR to add support for reading deletion vectors to delta-rs. In delta-rs there is a bit of logic to construct different types of schemas to use during a scan of a table (see https://github.com/delta-io/delta-rs/blob/acfaee5c07e47e9dadb3bbb5201ab75b245bf9b1/crates/core/src/delta_datafusion/table_provider.rs#L422-L700. I believe iceberg-rust is similar). If the row number column can be treated as an ordinary column, I believe deletion vector support would be simpler. > > let row_number_field = RowNumberField::new("my_row_num_col"); > > Regarding this, I like the approach, but as utility. I think users could always do `field.with_nullable(true)` and violate the intention. So the question is where does the check for expected properties happen? Perhaps once we call `with_metadata_columns`? Hmm, that's a good point. Maybe we could have it be `FieldRef` instead of `Field`? Though, that might limit other use cases, so might not be worth it. Validation sounds better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
