bkietz commented on issue #40583:
URL: https://github.com/apache/arrow/issues/40583#issuecomment-2008193391

   > I believe that the Skyhook file format (as implemented) is a contract that 
when files are written in a given format (arrow IPC) using a standard posix 
filesystem, they can be read using a different interface (arrow dataset).
   
   The problem is that the contract of skyhook as written is not the contract 
of a file format. File formats are intended to be an orthogonal detail to I/O. 
It would be possible to write a subclass of `arrow::Array` which is mutable, 
but although that is possible the broken contract of immutability will severely 
restrict its usage in the arrow library.
   
   > To rephrase your proposal, I believe you're suggesting we implement a 
custom operator (SkyhookSourceNode) to facilitate data flow between the client 
and server over the network. The input to the custom operator is executed by 
Skyhook and the results are handled by the client node.
   
   Yes, SkyhookSourceNode is a client side node which proxies the server side 
plan in the client side plan.
   
   > (every FileFragment is an independent access to a potentially distinct 
ceph server)
   
   I would recommend instead having a 1:1 relationship between 
SkyhookSourceNodes and ceph servers. If multiple ceph servers are in play, a 
UnionNode can be used in the client side plan to concatenate their streams.
   
   > would you know how Acero serializes custom operators into substrait?
   
   The design above does not require serializing custom nodes, since it only 
requires serialization of the server side plan. The SkyhookSourceNode will only 
appear in the client side plan.
   
   > I also want to note that pushing any more operators below the 
SkyhookSourceNode would be done blindly without some type of optimizer
   
   Optimization and other restructuring of plans is not currently in scope for 
acero. Instead the exact plan specified is what will be executed. The 
~~roadmap~~ hope is that acero's substrait support will continue to improve and 
that cross-engine plan optimizers will be written against substrait


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to