thinkharderdev commented on issue #7303:
URL: 
https://github.com/apache/arrow-datafusion/issues/7303#issuecomment-1785299825

   > I'm not sure it would be desirable to extend `PhysicalExtensionCodec` to 
include functions for handling `DataSink`s as not all extension types would 
deal with `DataSink`s. And having the encoding methods instead in `DataSink` 
itself I'm also unsure about, as it might be better to handle proto serde 
within the `datafusion-proto` crate itself.
   > 
   > I have in mind three solutions.
   > 
   > **1**
   > 
   > Leave as is, and require anyone desiring this behaviour to implement a 
`PhysicalExtensionNode` on their side with an accompanying 
`PhysicalExtensionCodec` implementation. This is probably the least desirable 
option considering `FileSinkExec` (aka `InsertExec`) is provided by DataFusion, 
so one might expect it to support proto SerDe.
   > 
   > **2**
   > 
   > There are currently 3 implementations of `DataSink` provided by DataFusion:
   > 
   > * `ParquetSink`
   > * `CsvSink`
   > * `JsonSink`
   > 
   > (`MemSink` I'm not counting as I don't think we can support proto SerDe 
for it...)
   > 
   > So we could introduce three new `PhysicalPlanNode` types into proto:
   > 
   > ```proto
   > message ParquetFileSinkExecNode {
   >   ...
   > }
   > 
   > message CsvFileSinkExecNode {
   >   ...
   > }
   > 
   > message JsonFileSinkExecNode {
   >   ...
   > }
   > ```
   > 
   > These represent the default supported `FileSinkExec`s by DataFusion, and 
if a consumer has their own version of `DataSink` implemented for a 
`FileSinkExec` (I think that's possible?), then they would need to rely on 
implementing it via `PhysicalExtensionNode` with an accompanying 
`PhysicalExtensionCodec`.
   > 
   > **3**
   > 
   > Instead of smashing the `FileSinkExec` and `DataSink`s together in option 
2, we would separate them:
   > 
   > ```proto
   > message FileSinkExec {
   >   PhysicalPlanNode input = 1;
   >   DataSink sink = 2;
   >   ...
   > }
   > 
   > message DataSink {
   >   oneof DataSinkImpl {
   >     ParquetSink parquet_sink = 1;
   >     CsvSink csv_sink = 2;
   >     JsonSink json_sink = 3;
   >     DataSinkExtension extension = 4;
   >   }
   > }
   > 
   > message ParquetSink {
   >   ...
   > }
   > 
   > message CsvSink {
   >   ...
   > }
   > 
   > message JsonSink {
   >   ...
   > }
   > 
   > message DataSinkExtension {
   >   bytes sink = 1;
   > }
   > ```
   > 
   > This would more accurately model how it's represented in DataFusion, but 
has the disadvantage of requiring a new extension type, which presumably would 
need a new codec as well (similar to how `PhysicalExtensionNode` requires 
`PhysicalExtensionCodec`)
   > 
   > I'm taking a stab at the third option (though maybe the second option 
would be better? 🤔), but would greatly appreciate any insight on the matter.
   
   FWIW option 2 is what we are doing now and it works easily enough. But I 
think option 3 sounds pretty good to me. But that seems like it would basically 
just boil down to adding methods to `PhysicalExtensionCodec`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to