[GitHub] [arrow] JayjeetAtGithub commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

GitBox Mon, 02 Aug 2021 21:58:22 -0700


JayjeetAtGithub commented on pull request #10431:
URL: https://github.com/apache/arrow/pull/10431#issuecomment-891525546



   Thanks @westonpace for sharing your thoughts.
   
   > So here is my current understanding. Let me know if this seems off. There 
are two pieces to this.
   > 
   > There is a ceph object class (called Skyhook?) which processes scan tasks 
and lives in a "contrib" directory.
   > 
   > There is a fragment / file format for Arrow that understands how to send 
scan requests to a ceph storage server in the skyhook format.
   > 
   That is correct.
   
   > These two components aren't tightly coupled. The only source of agreement 
is the Arrow columnar format and this flatbuffers file. So for example (these 
are thought exercises, not things that will necessarily ever happen):
   > 
   > * Ceph could be running an older version of Skyhook built with Arrow 
version X and the dataset client could be running a newer version of Arrow 
version X+N.
   
   Yeah, this could happen. In this case, we need to ensure that the storage 
side understand the `ScanRequest` language in which the client sends requests 
and can also the serialize tables in a buffer format understandable/supported 
by the client.
   
   > * Skyhook could switch to some other library entirely in the future and as 
long as it continued to respect the flatbuffers format it would continue to 
work.
   
   Similar as above I guess.
   
   > * A different non-arrow library (or an Arrow implementation in a different 
language) could decide to start sending requests to Skyhook and as long as they 
agreed upon the flatbuffers and arrow columnar format everything would continue 
to work.
   
   Yes, as along as both the client and server agrees upon the same send and 
receive protocol, they should work fine.
   
   > Given the above I think the proper place for this flatbuffers file to live 
is in the same directory as the ceph object class. This flatbuffers file is the 
API for skyhook.
   
   I agree. This is the send API for skyhook.
   
   > Then, for building everything, the make files for that directory could 
produce two artifacts: A ceph object class and a small C++ "client library" 
which is just the output of the flatbuffers compiler.
   > Or you could skip the "client library" step and add an extra build step 
for the datasets module which runs the flatbuffers compiler.
   
   Could you please explain this part a little more?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] JayjeetAtGithub commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

Reply via email to