jkylling opened a new pull request, #20133:
URL: https://github.com/apache/datafusion/pull/20133

   It would be useful to expose the virtual columns of the arrow Parquet reader 
in the datasource-parquet `ParquetSource` we added in 
https://github.com/apache/arrow-rs/pull/8715. Then engines can use both 
DataFusion's partition value machinery and the virtual columns. I made a go at 
it in this PR, but hit some rough edges. This is closer to an issue than a PR, 
but it is easier to explain with code.
   
   The virtual columns we added are a bit difficult to integrate cleanly today. 
They are part of the physical schema of the Parquet reader, but cannot 
currently be projected. We need some additional handling to avoid predicate 
pushdown for virtual columns, to build the correct projection mask, and to 
build the correct stream schema. See the changes to `opener.rs` in this PR.
   
   One alternative would be to modify the arrow-rs implementation to remove 
these workarounds. Then the only change to `opener.rs` would be 
`.with_virtual_columns(virtual_columns.to_vec())?` (and maybe even that could 
be avoided? See the discussion below).
   
   What would be the best way forward here?
   
   ## Aside on `.with_virtual_columns`
   
   It is redundant that the user needs to specify both `Field::new("row_index", 
DataType::Int64, false).with_extension_type(RowNumber)`, and add the column in 
a special way to the reader options with 
`.with_virtual_columns(virtual_columns.to_vec())?`. When the extension type 
`RowNumber` is added, we know that it is a virtual column.
   
   All users of the `TableSchema/ParquetSource` must know that a schema is 
built out of three parts: the physical Parquet columns, the virtual columns and 
the partition columns. From a user perspective, the user would just like to 
supply a schema.
   
   One alternative is to only indicate the column kind using extension types, 
and the user only supplies a schema. That is, there would be an extension type 
indicating that a column is a partition column or virtual column, instead of 
the user supplying this information piecemeal. This may have a performance 
impact, as we would likely need to extract different extension type columns 
during planning, which could be problematic for large schemas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to