JoshElkind opened a new pull request, #20284:
URL: https://github.com/apache/datafusion/pull/20284

   ## Which issue does this PR close?
   
   - Closes #20280.
   
   ## Rationale for this change
   
   Physical plans that read Arrow files (.arrow / IPC) could not be serialized 
or deserialized via the proto layer. PhysicalPlanNode already had scan nodes 
for Parquet, CSV, JSON, Avro, and in-memory sources, but not for Arrow, so a 
DataSourceExec using ArrowSource was not round-trippable. That blocked use 
cases like distributing plans that scan Arrow files (e.g. Ballista). This 
change adds Arrow scan to the proto layer so those plans can be serialized and 
deserialized like the other file formats.
   
   ## What changes are included in this PR?
   
   Proto: Added ArrowScanExecNode (with FileScanExecConf base_conf) and 
arrow_scan = 38 to the PhysicalPlanNode oneof in datafusion.proto.
   
   Generated code: Updated prost.rs and pbjson.rs to include ArrowScanExecNode 
and the ArrowScan variant (manual edits; protoc was not run).
   
   To-proto: In try_from_data_source_exec, when the data source is a 
FileScanConfig whose file source is ArrowSource, it is now serialized as 
ArrowScanExecNode.
   
   From-proto: Implemented try_into_arrow_scan_physical_plan to deserialize 
ArrowScanExecNode into DataSourceExec with ArrowSource; missing base_conf 
returns an explicit error (no .unwrap()).
   
   Test: Added roundtrip_arrow_scan in roundtrip_physical_plan.rs to assert 
Arrow scan plans round-trip correctly.
   
   ## Are these changes tested?
   
   Yes. A new test roundtrip_arrow_scan builds a physical plan that scans Arrow 
files, serializes it to bytes and deserializes it back, and asserts the 
round-tripped plan matches the original. The full cargo test -p 
datafusion-proto suite (150 tests: unit, integration, and doc tests) passes, 
including all existing roundtrip and serialization tests.
   
   ## Are there any user-facing changes?
   
   No. This only extends the existing physical-plan proto support to Arrow 
scan. Callers that already serialize/deserialize physical plans (e.g. for 
distributed execution) can now round-trip plans that read Arrow files in 
addition to Parquet, CSV, JSON, and Avro, with no API or behavioral changes for 
existing usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to