drin commented on issue #40583:
URL: https://github.com/apache/arrow/issues/40583#issuecomment-2007908817

   Wow, thanks so much for the write up: it is well written and gives me clear 
ideas of the direction you're proposing and where the drawbacks of the Skyhook 
file format are.
   
   I want to clarify (or maybe pushback?) on a few things, but overall I see a 
path forward with minimal changes to your suggestions.
   
   > The basic features of skyhook are a configurable server side scan/compute, 
and efficient transport of resulting buffers to the client. This does not 
correspond to a file format; file formats compartmentalize reading data files 
independent of I/O.
   
   I believe that the Skyhook file format (as implemented) is a contract that 
when files are written in a given format (arrow IPC) using a standard posix 
filesystem, they can be read using a different interface (arrow dataset). 
Because two separate I/O interfaces are used, the file format accommodates that 
in the contract. It is akin to saying that if you write files in a particular 
way, you can process the blocks of the file without changing how you write 
files.
   
   > Writing skyhook as a file format therefore breaks conventions and 
contracts relied on by the dataset API.
   
   Yes, this part I understand. I would just add that this is because ceph 
(which skyhook is an extension for) has a broader definition of what a "file" 
is in order to map it to object storage. Any system that shims a filesystem 
interface over the actual storage interface is going to have a similar 
impedance mismatch (e.g. s3fs or anything like that).
   
   All that being said, I agree that nearly all of the changes you've proposed 
are good; the only one I think we won't do is grouping of files (unless we find 
a way to shim a "special directory" as a grouping, but that sounds hack-y).
   
   To rephrase your proposal, I believe you're suggesting we implement a custom 
operator (`SkyhookSourceNode`) to facilitate data flow between the client and 
server over the network. The input to the custom operator is executed by 
Skyhook and the results are handled by the client node.
   
   I think the catalog can still be resolved on the client side and the custom 
operator can be used in many connections (every FileFragment is an independent 
access to a potentially distinct ceph server) and this would only require very 
simple ExecPlan transformations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to