[GitHub] [arrow] westonpace commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

GitBox Tue, 03 Aug 2021 17:55:28 -0700


westonpace commented on pull request #10431:
URL: https://github.com/apache/arrow/pull/10431#issuecomment-892275122



   > Could you please explain this part a little more?
   
   Sure, let me prioritize and group things a bit too.
   
   # Move flatbuffers out of `format`
   
   The file `ScanRequest.fbs` should come out of the `format` directory.  This 
file is the "Skyhook" portion of the code and do not need to be a part of Arrow.
   
   The `format` directory represents the core Arrow protocols that need to be 
able to stand alone in any language or on any hardware.  Since 
`ScanRequest.fbs` is specific to Rados/Skyhook it can't go here.  I don't think 
this requirement is very flexible.  Unfortunately, it does probably mean you 
need to add a new flatbuffers compilation step or modify the existing one to 
build a new directory.
   
   # Move `cls` files out of `cpp/src/arrow`
   
   The `cpp/src/arrow` directory (and its subdirectories) create object files 
meant to be included by developers developing code based on Arrow.  
`cls_arrow.cc` isn't really the same thing.  It is a "standalone artifact" that 
is deployed on a Ceph server.  It should go in its own directory to make it 
clear.  I think going one directory up to `cpp/src/ceph` or `cpp/src/skyhook` 
would be fine.  Although, as these tools grow we may want to create a 
`cpp/src/contrib` (and move the other items `cpp/src/arrow/adapters` here too).
   
   I suppose there is some precedence with the `orc`/`tensorflow` stuff here so 
I could be convinced this isn't a big deal but I feel it would be cleaner to 
keep it separate.
   
   # Splitting code into service/protocol/client/client-adapter/dataset-adapter
   
   For other integrations (s3, parquet, orc, csv, hdfs) the code tends to 
follow a certain pattern.  If you follow this pattern it will be clearer what 
the purpose of each of your files is.  This means moving some of your code 
around (e.g. you combine client code and dataset-adapter code in the same 
file).  This ask is lower priority than the previous asks but it will help 
clean things up.
   
   **Service**: This 3rd party code represents the service itself.  This code 
should stand on its own (if it uses Arrow that is an implementation detail and 
could change with no real consequence).  For most integrations we don't 
actually have this code in the Arrow code base.  For example, we don't have the 
code to run an S3 server.  We don't have the code to run an HDFS cluster.  For 
formats like CSV or parquet there is no "service".
   
   For this PR the "service" is the Ceph class.
   
   **Protocol/API**: This 3rd party concept defines how to interact with the 
service.  Again, this should stand on its own and the API should be consumable 
by tools other than Arrow.  It may or may not be compiled into code.  For 
example, with parquet we have `src/parquet/parquet.thrift`.  S3 has an HTTP API 
but we don't interact with it directly.
   
   For this PR this is represented by the Skyhook flatbuffer files.
   
   **Client**: This 3rd party code makes it easier to interact with the 
protocol.  Again, this stands on its own and if it is using Arrow that is an 
implementation detail.  For S3 we have the AWS S3 API.  For HDFS we have 
libhdfs.  For parquet we have `cpp/src/parquet`.
   
   This is not as clear in the PR but I think it consists of librados, 
`rados.h`, `rados.cc`, and the methods `SerializeScanRequest`, 
`DeserializeScanRequest`, `SerializeTable`, and `DeserializeTable`.
   
   **Client Adapter**: This Arrow code wraps a client interface and completely 
hides any client headers.  The input and output of the client adapter should be 
Arrow types (although it will be converting to client types under the hood).  
For S3 we have `cpp/src/arrow/filesystem/s3fs.h`.  For hdfs we have 
`cpp/src/arrow/filesystem/hdfs.h`.  For Parquet we have `cpp/src/parquet/arrow`.
   
   For this PR I think this consists of `RadosConnection` and 
`SkyhookDirectObjectAccess` although some improvement is needed here because 
these types don't fully hide `ceph::bufferlist`.
   
   **Dataset Adapter**: This Arrow-Dataset code uses the Client Adapter to 
create an implementation of FileFormat or Fragment to bridge the gap between 
the datasets API and the client service.
   
   For this PR this consists of `SkyhookFileFormat`.
   
   # Splitting the PR into multiple PRs
   
   This is completely optional but I personally think it will be faster and 
easier to break things into smaller pieces.
   
   I'm flexible on most things so let me know if any of this seems like it will 
be a problem or is overly strict.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #10431: ARROW-12921: [C++][Dataset] Add RadosParquetFileFormat to Dataset API

Reply via email to