drin commented on issue #40583: URL: https://github.com/apache/arrow/issues/40583#issuecomment-2007908817
Wow, thanks so much for the write up: it is well written and gives me clear ideas of the direction you're proposing and where the drawbacks of the Skyhook file format are. I want to clarify (or maybe pushback?) on a few things, but overall I see a path forward with minimal changes to your suggestions. > The basic features of skyhook are a configurable server side scan/compute, and efficient transport of resulting buffers to the client. This does not correspond to a file format; file formats compartmentalize reading data files independent of I/O. I believe that the Skyhook file format (as implemented) is a contract that when files are written in a given format (arrow IPC) using a standard posix filesystem, they can be read using a different interface (arrow dataset). Because two separate I/O interfaces are used, the file format accommodates that in the contract. It is akin to saying that if you write files in a particular way, you can process the blocks of the file without changing how you write files. > Writing skyhook as a file format therefore breaks conventions and contracts relied on by the dataset API. Yes, this part I understand. I would just add that this is because ceph (which skyhook is an extension for) has a broader definition of what a "file" is in order to map it to object storage. Any system that shims a filesystem interface over the actual storage interface is going to have a similar impedance mismatch (e.g. s3fs or anything like that). All that being said, I agree that nearly all of the changes you've proposed are good; the only one I think we won't do is grouping of files (unless we find a way to shim a "special directory" as a grouping, but that sounds hack-y). To rephrase your proposal, I believe you're suggesting we implement a custom operator (`SkyhookSourceNode`) to facilitate data flow between the client and server over the network. The input to the custom operator is executed by Skyhook and the results are handled by the client node. I think the catalog can still be resolved on the client side and the custom operator can be used in many connections (every FileFragment is an independent access to a potentially distinct ceph server) and this would only require very simple ExecPlan transformations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
