bkietz commented on issue #40583:
URL: https://github.com/apache/arrow/issues/40583#issuecomment-2009541947

   > If SkyhookSourceNode is a client-only node, then it seems like you're 
proposing execution of 2 independent ExecPlans
   
   Precisely; a server side plan which reads files and performs pushed-down 
compute finally pushing batches to the client, and a client side plan which 
receives these batches and performs other computation on them. This is 
analogous to the current structure of Skyhook which uses arrow datasets on the 
[server 
side](https://github.com/bkietz/arrow/blob/46758bc3c6321d8b7013acf52bff7761a8b33eda/cpp/src/skyhook/cls/cls_skyhook.cc#L153)
 to scan/filter/project/transmit a Table to the client, and on the [client 
side](https://github.com/bkietz/arrow/blob/46758bc3c6321d8b7013acf52bff7761a8b33eda/cpp/src/skyhook/client/file_skyhook.h#L58)
 to act as a data source.
   
   For example: the server side plan might read a set of parquet files with a 
pushed down filter and perform aggregation while the client side plan includes 
a UnionNode and collects batches from the skyhook server and from a local 
dataset, collecting into a single Table:
   
   ```
   # server side:
   ScanNode(nyc-taxi/*.parquet)
     -> FilterNode(year==2016)
       -> AggregateNode(cost ON tag)
          -> SinkNode(push ceph::bufferlist to client)
   
   # client side:
   ScanNode(local-addenda.parquet) ---------------------------v
   SkyhookSourceNode(receive ceph::bufferlist from server) -> UnionNode -> 
TableSinkNode
   ```
   
   In this example, the client side plan includes a SkyhookSourceNode to 
facilitate network communication by forwarding batches from the server into the 
UnionNode. For another example, the client side plan could union streams from 
three different skyhook servers (since we have UnionNode, there's no need to 
complicate SkyhookSourceNode by forcing it to deal with multiple servers):
   
   ```
   # client side:
   SkyhookSourceNode(receive ceph::bufferlist from server A) ---v
   SkyhookSourceNode(receive ceph::bufferlist from server B) -> UnionNode -> 
TableSinkNode
   SkyhookSourceNode(receive ceph::bufferlist from server C) ---^
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to