bkietz commented on issue #40583: URL: https://github.com/apache/arrow/issues/40583#issuecomment-2008193391
> I believe that the Skyhook file format (as implemented) is a contract that when files are written in a given format (arrow IPC) using a standard posix filesystem, they can be read using a different interface (arrow dataset). The problem is that the contract of skyhook as written is not the contract of a file format. File formats are intended to be an orthogonal detail to I/O. It would be possible to write a subclass of `arrow::Array` which is mutable, but although that is possible the broken contract of immutability will severely restrict its usage in the arrow library. > To rephrase your proposal, I believe you're suggesting we implement a custom operator (SkyhookSourceNode) to facilitate data flow between the client and server over the network. The input to the custom operator is executed by Skyhook and the results are handled by the client node. Yes, SkyhookSourceNode is a client side node which proxies the server side plan in the client side plan. > (every FileFragment is an independent access to a potentially distinct ceph server) I would recommend instead having a 1:1 relationship between SkyhookSourceNodes and ceph servers. If multiple ceph servers are in play, a UnionNode can be used in the client side plan to concatenate their streams. > would you know how Acero serializes custom operators into substrait? The design above does not require serializing custom nodes, since it only requires serialization of the server side plan. The SkyhookSourceNode will only appear in the client side plan. > I also want to note that pushing any more operators below the SkyhookSourceNode would be done blindly without some type of optimizer Optimization and other restructuring of plans is not currently in scope for acero. Instead the exact plan specified is what will be executed. The ~~roadmap~~ hope is that acero's substrait support will continue to improve and that cross-engine plan optimizers will be written against substrait -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
