friendlymatthew commented on PR #17242: URL: https://github.com/apache/datafusion/pull/17242#issuecomment-3214841600
I was able to identify the root cause behind several test failures related to roundtripping physical plans in `datafusion/proto` <details> <summary>impacted test cases</summary> `cases::roundtrip_physical_plan::roundtrip_empty_projection` `cases::roundtrip_physical_plan::roundtrip_parquet_select_projection` `cases::roundtrip_physical_plan::roundtrip_parquet_select_projection_predicate` `cases::roundtrip_physical_plan::roundtrip_parquet_select_star` `cases::roundtrip_physical_plan::roundtrip_parquet_select_star_predicate` `cases::roundtrip_physical_plan::test_round_trip_date_part_display` `cases::roundtrip_physical_plan::test_round_trip_groups_display` `cases::roundtrip_physical_plan::test_round_trip_human_display` `cases::roundtrip_physical_plan::test_round_trip_tpch_queries` </details> _tl;dr, we expect `FileScanConfig` to be lossless when converting to/from protobuf. However, `FileSource` is **not** lossless. Now that `FileSource` implements `DataSource`, roundtripping physical plans loses some information, which is expected-- but the current `assert_eq` fails because of it._ On `main`, `FileScanConfig` implements `DataSource` directly, and when we roundtrip physical plans, we compare the `Display` output of `FileScanConfig`. Although `FileScanConfig` holds a file source struct, we display minimal info (e.g, source type: `"parquet"`). However in this PR, the ownership model changes: now all `FileSource` structs own a `FileScanConfig`, and they implement `DataSource` directly. Because we `#derive(Debug)` for `FileSource` types, the debug output includes fields that are not preserved through serialization. For example, in `ParquetSource`, the `parquet_file_reader_factory` field is **not** retained after deserialization. [Only the `TableParquetOptions are serialized and restored](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/src/physical_plan/mod.rs#L728-L742); everything else is set to `None`. ## Thoughts on fixes I see 2 possible directions for resolving the test failures: 1. Loosen the [assertion](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/tests/cases/roundtrip_physical_plan.rs#L127-L151) when verifying `DataSourceExec` Instead of comparing the entire debug output, we could fall back to comparing the `Display` output of the underlying `FileScanConfig`, as is done on `main`. This would ignore non-serialized fields 2. Encode extra metadata into `TableParquetOptions` We could explicitly encode some of the additional fields (e.g. presence of a custom reader factory) into the serialized metadata. __But__ this feels futile-- since these fields aren't actually restored on deserialization, tracking their prior existence is extra overhead I'm leaning towards 1 since it reflects the current lossless boundary and avoids overcomplicating serialization. But I'm curious if people have any opinions cc @adriangb @berkaysynnada @mbrubeck @xudong963 @comphead @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org