friendlymatthew commented on PR #17242:
URL: https://github.com/apache/datafusion/pull/17242#issuecomment-3214841600

   I was able to identify the root cause behind several test failures related 
to roundtripping physical plans in `datafusion/proto`
   
   <details>
   <summary>impacted test cases</summary>
   
     `cases::roundtrip_physical_plan::roundtrip_empty_projection`
     `cases::roundtrip_physical_plan::roundtrip_parquet_select_projection`
     
`cases::roundtrip_physical_plan::roundtrip_parquet_select_projection_predicate`
     `cases::roundtrip_physical_plan::roundtrip_parquet_select_star`
     `cases::roundtrip_physical_plan::roundtrip_parquet_select_star_predicate`
     `cases::roundtrip_physical_plan::test_round_trip_date_part_display`
     `cases::roundtrip_physical_plan::test_round_trip_groups_display`
     `cases::roundtrip_physical_plan::test_round_trip_human_display`
     `cases::roundtrip_physical_plan::test_round_trip_tpch_queries`
   
   </details>
   
   _tl;dr, we expect `FileScanConfig` to be lossless when converting to/from 
protobuf. However, `FileSource` is **not** lossless. Now that `FileSource` 
implements `DataSource`, roundtripping physical plans loses some information, 
which is expected-- but the current `assert_eq` fails because of it._
   
   On `main`, `FileScanConfig` implements `DataSource` directly, and when we 
roundtrip physical plans, we compare the `Display` output of `FileScanConfig`. 
Although `FileScanConfig` holds a file source struct, we display minimal info 
(e.g, source type: `"parquet"`). However in this PR, the ownership model 
changes: now all `FileSource` structs own a `FileScanConfig`, and they 
implement `DataSource` directly. Because we `#derive(Debug)` for `FileSource` 
types, the debug output includes fields that are not preserved through 
serialization. 
   
   For example, in `ParquetSource`, the `parquet_file_reader_factory` field is 
**not** retained after deserialization. [Only the `TableParquetOptions are 
serialized and 
restored](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/src/physical_plan/mod.rs#L728-L742);
 everything else is set to `None`. 
   
   ## Thoughts on fixes
   
   I see 2 possible directions for resolving the test failures: 
   
   1. Loosen the 
[assertion](https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/proto/tests/cases/roundtrip_physical_plan.rs#L127-L151)
 when verifying `DataSourceExec`
   Instead of comparing the entire debug output, we could fall back to 
comparing the `Display` output of the underlying `FileScanConfig`, as is done 
on `main`. This would ignore non-serialized fields
   
   2. Encode extra metadata into `TableParquetOptions`
   We could explicitly encode some of the additional fields (e.g. presence of a 
custom reader factory) into the serialized metadata. __But__ this feels 
futile-- since these fields aren't actually restored on deserialization, 
tracking their prior existence is extra overhead
   
   
   I'm leaning towards 1 since it reflects the current lossless boundary and 
avoids overcomplicating serialization. But I'm curious if people have any 
opinions
   
   cc @adriangb @berkaysynnada @mbrubeck @xudong963 @comphead @blaginin


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to