jkylling commented on issue #7299:
URL: https://github.com/apache/arrow-rs/issues/7299#issuecomment-3437532690

   > [@jkylling](https://github.com/jkylling) I like the idea of generalizing 
with `with_metadata_columns` instead of `with_row_number_column`. We need some 
sort of validation that RowNumber is a metadata column, I hope extension types 
implementation can have that.
   > 
   > (1) is good, but it's not very user-friendly for "SELECT *" kind of scans. 
Could we get easily get schema and extend it with a new column?
   >
   > On second though, perhaps it's good to separate metadata columns from 
`with_schema`, i.e. to make it impossible to pass them there, but only through 
`with_metadata_columns`. Behind the scenes, they should be unified in a single 
`schema`, but they don't overwrite each other (this means that the builder has 
to keep track of these options separately, so that if any of them is updated, 
the combined schema is correctly updated). The downside here is that it has to 
have some default semantics where these metadata columns are added, e.g. at the 
end of regulard columns.
   
   I think we should allow sending it in the regular schema. From the point of 
view of the user of the `ParquetRecordBatchReader` it should look like the 
Parquet file has a physical row number column. This is based on experimenting 
with the PR to add support for reading deletion vectors to delta-rs. In 
delta-rs there is a bit of logic to construct different types of schemas to use 
during a scan of a table (see  
https://github.com/delta-io/delta-rs/blob/acfaee5c07e47e9dadb3bbb5201ab75b245bf9b1/crates/core/src/delta_datafusion/table_provider.rs#L422-L700.
 I believe iceberg-rust is similar). If the row number column can be treated as 
an ordinary column, I believe deletion vector support would be simpler.
    
   > > let row_number_field = RowNumberField::new("my_row_num_col");
   > 
   > Regarding this, I like the approach, but as utility. I think users could 
always do `field.with_nullable(true)` and violate the intention. So the 
question is where does the check for expected properties happen? Perhaps once 
we call `with_metadata_columns`?
   
   Hmm, that's a good point. Maybe we could have it be `FieldRef` instead of 
`Field`? Though, that might limit other use cases, so might not be worth it. 
Validation sounds better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to