westonpace commented on pull request #10431: URL: https://github.com/apache/arrow/pull/10431#issuecomment-892275122
> Could you please explain this part a little more? Sure, let me prioritize and group things a bit too. # Move flatbuffers out of `format` The file `ScanRequest.fbs` should come out of the `format` directory. This file is the "Skyhook" portion of the code and do not need to be a part of Arrow. The `format` directory represents the core Arrow protocols that need to be able to stand alone in any language or on any hardware. Since `ScanRequest.fbs` is specific to Rados/Skyhook it can't go here. I don't think this requirement is very flexible. Unfortunately, it does probably mean you need to add a new flatbuffers compilation step or modify the existing one to build a new directory. # Move `cls` files out of `cpp/src/arrow` The `cpp/src/arrow` directory (and its subdirectories) create object files meant to be included by developers developing code based on Arrow. `cls_arrow.cc` isn't really the same thing. It is a "standalone artifact" that is deployed on a Ceph server. It should go in its own directory to make it clear. I think going one directory up to `cpp/src/ceph` or `cpp/src/skyhook` would be fine. Although, as these tools grow we may want to create a `cpp/src/contrib` (and move the other items `cpp/src/arrow/adapters` here too). I suppose there is some precedence with the `orc`/`tensorflow` stuff here so I could be convinced this isn't a big deal but I feel it would be cleaner to keep it separate. # Splitting code into service/protocol/client/client-adapter/dataset-adapter For other integrations (s3, parquet, orc, csv, hdfs) the code tends to follow a certain pattern. If you follow this pattern it will be clearer what the purpose of each of your files is. This means moving some of your code around (e.g. you combine client code and dataset-adapter code in the same file). This ask is lower priority than the previous asks but it will help clean things up. **Service**: This 3rd party code represents the service itself. This code should stand on its own (if it uses Arrow that is an implementation detail and could change with no real consequence). For most integrations we don't actually have this code in the Arrow code base. For example, we don't have the code to run an S3 server. We don't have the code to run an HDFS cluster. For formats like CSV or parquet there is no "service". For this PR the "service" is the Ceph class. **Protocol/API**: This 3rd party concept defines how to interact with the service. Again, this should stand on its own and the API should be consumable by tools other than Arrow. It may or may not be compiled into code. For example, with parquet we have `src/parquet/parquet.thrift`. S3 has an HTTP API but we don't interact with it directly. For this PR this is represented by the Skyhook flatbuffer files. **Client**: This 3rd party code makes it easier to interact with the protocol. Again, this stands on its own and if it is using Arrow that is an implementation detail. For S3 we have the AWS S3 API. For HDFS we have libhdfs. For parquet we have `cpp/src/parquet`. This is not as clear in the PR but I think it consists of librados, `rados.h`, `rados.cc`, and the methods `SerializeScanRequest`, `DeserializeScanRequest`, `SerializeTable`, and `DeserializeTable`. **Client Adapter**: This Arrow code wraps a client interface and completely hides any client headers. The input and output of the client adapter should be Arrow types (although it will be converting to client types under the hood). For S3 we have `cpp/src/arrow/filesystem/s3fs.h`. For hdfs we have `cpp/src/arrow/filesystem/hdfs.h`. For Parquet we have `cpp/src/parquet/arrow`. For this PR I think this consists of `RadosConnection` and `SkyhookDirectObjectAccess` although some improvement is needed here because these types don't fully hide `ceph::bufferlist`. **Dataset Adapter**: This Arrow-Dataset code uses the Client Adapter to create an implementation of FileFormat or Fragment to bridge the gap between the datasets API and the client service. For this PR this consists of `SkyhookFileFormat`. # Splitting the PR into multiple PRs This is completely optional but I personally think it will be faster and easier to break things into smaller pieces. I'm flexible on most things so let me know if any of this seems like it will be a problem or is overly strict. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
