friendlymatthew commented on PR #17242:
URL: https://github.com/apache/datafusion/pull/17242#issuecomment-3214841600
I was able to identify the root cause behind several test failures related
to roundtripping physical plans in `datafusion/proto`
<details>
<summary>impacted test cases</summary>
`cases::roundtrip_physical_plan::roundtrip_empty_projection`
`cases::roundtrip_physical_plan::roundtrip_parquet_select_projection`
`cases::roundtrip_physical_plan::roundtrip_parquet_select_projection_predicate`
`cases::roundtrip_physical_plan::roundtrip_parquet_select_star`
`cases::roundtrip_physical_plan::roundtrip_parquet_select_star_predicate`
`cases::roundtrip_physical_plan::test_round_trip_date_part_display`
`cases::roundtrip_physical_plan::test_round_trip_groups_display`
`cases::roundtrip_physical_plan::test_round_trip_human_display`
`cases::roundtrip_physical_plan::test_round_trip_tpch_queries`
</details>
_tl;dr, we expect `FileScanConfig` to be lossless when converting to/from
protobuf. However, `FileSource` is **not** lossless. Now that `FileSource`
implements `DataSource`, roundtripping physical plans loses some information,
which is expected-- but the current `assert_eq` fails because of it._
On `main`, `FileScanConfig` implements `DataSource` directly, and when we
roundtrip physical plans, we compare the `Display` output of `FileScanConfig`.
Although `FileScanConfig` holds a file source struct, we display minimal info
(e.g, source type: `"parquet"`). However in this PR, the ownership model
changes: now all `FileSource` structs own a `FileScanConfig`, and they
implement `DataSource` directly. Because we `#derive(Debug)` for `FileSource`
types, the debug output includes fields that are not preserved through
serialization.
For example, in `ParquetSource`, the `parquet_file_reader_factory` field is
**not** retained after deserialization. [Only the `TableParquetOptions are
serialized and
restored](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/src/physical_plan/mod.rs#L728-L742);
everything else is set to `None`.
## Thoughts on fixes
I see 2 possible directions for resolving the test failures:
1. Loosen the
[assertion](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/tests/cases/roundtrip_physical_plan.rs#L127-L151)
when verifying `DataSourceExec`
Instead of comparing the entire debug output, we could fall back to
comparing the `Display` output of the underlying `FileScanConfig`, as is done
on `main`. This would ignore non-serialized fields
2. Encode extra metadata into `TableParquetOptions`
We could explicitly encode some of the additional fields (e.g. presence of a
custom reader factory) into the serialized metadata. __But__ this feels
futile-- since these fields aren't actually restored on deserialization,
tracking their prior existence is extra overhead
I'm leaning towards 1 since it reflects the current lossless boundary and
avoids overcomplicating serialization. But I'm curious if people have any
opinions
cc @adriangb @berkaysynnada @mbrubeck @xudong963 @comphead @blaginin
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]