bkietz commented on issue #40583: URL: https://github.com/apache/arrow/issues/40583#issuecomment-2009541947
> If SkyhookSourceNode is a client-only node, then it seems like you're proposing execution of 2 independent ExecPlans Precisely; a server side plan which reads files and performs pushed-down compute finally pushing batches to the client, and a client side plan which receives these batches and performs other computation on them. This is analogous to the current structure of Skyhook which uses arrow datasets on the [server side](https://github.com/bkietz/arrow/blob/46758bc3c6321d8b7013acf52bff7761a8b33eda/cpp/src/skyhook/cls/cls_skyhook.cc#L153) to scan/filter/project/transmit a Table to the client, and on the [client side](https://github.com/bkietz/arrow/blob/46758bc3c6321d8b7013acf52bff7761a8b33eda/cpp/src/skyhook/client/file_skyhook.h#L58) to act as a data source. For example: the server side plan might read a set of parquet files with a pushed down filter and perform aggregation while the client side plan includes a UnionNode and collects batches from the skyhook server and from a local dataset, collecting into a single Table: ``` # server side: ScanNode(nyc-taxi/*.parquet) -> FilterNode(year==2016) -> AggregateNode(cost ON tag) -> SinkNode(push ceph::bufferlist to client) # client side: ScanNode(local-addenda.parquet) ---------------------------v SkyhookSourceNode(receive ceph::bufferlist from server) -> UnionNode -> TableSinkNode ``` In this example, the client side plan includes a SkyhookSourceNode to facilitate network communication by forwarding batches from the server into the UnionNode. For another example, the client side plan could union streams from three different skyhook servers (since we have UnionNode, there's no need to complicate SkyhookSourceNode by forcing it to deal with multiple servers): ``` # client side: SkyhookSourceNode(receive ceph::bufferlist from server A) ---v SkyhookSourceNode(receive ceph::bufferlist from server B) -> UnionNode -> TableSinkNode SkyhookSourceNode(receive ceph::bufferlist from server C) ---^ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
