westonpace commented on a change in pull request #10431:
URL: https://github.com/apache/arrow/pull/10431#discussion_r677894587



##########
File path: ci/scripts/cpp_build.sh
##########
@@ -85,6 +85,8 @@ cmake -G "${CMAKE_GENERATOR:-Ninja}" \
       -DARROW_PYTHON=${ARROW_PYTHON:-OFF} \
       -DARROW_RUNTIME_SIMD_LEVEL=${ARROW_RUNTIME_SIMD_LEVEL:-MAX} \
       -DARROW_S3=${ARROW_S3:-OFF} \
+      -DARROW_RADOS=${ARROW_RADOS:-OFF} \
+      -DARROW_CLS=${ARROW_CLS:-OFF} \

Review comment:
       Can we name this `ARROW_CEPH_CLS` or `ARROW_CEPH`?

##########
File path: cpp/src/arrow/adapters/arrow-rados-cls/README.md
##########
@@ -0,0 +1,78 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+# <img src="https://iris-hep.org/assets/logos/skyhookdmLogoJeff.png"; 
width="64" valign="middle" alt="Skyhook"/> SkyhookDM-Arrow
+
+Apache Arrow provides a 
[`Dataset`](https://arrow.apache.org/docs/cpp/api/dataset.html) API, which acts 
as an abstraction over a collection of files in different storage backends like 
S3 and HDFS. It supports different file formats like CSV and Parquet through 
the 
[`FileFormat`](https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset10FileFormatE)
 API. In SkyhookDM, we create a new file format called `RadosParquetFileFormat` 
on top of `ParquetFileFormat`, which besides providing all the features of 
Parquet allows offloading file fragment scan operations into the storage 
backend. Offloading scan operations increases the query performance many folds, 
provides better scalability, and results in less network traffic. The 
architecture of SkyhookDM is described [here](./docs/architecture.md).

Review comment:
       I think we should probably avoid references to Skyhook from within 
Arrow.  I think that will raise confusion.  Could we describe the architecture 
with something more like "Arrow Ceph object class" terminology?

##########
File path: cpp/src/arrow/dataset/rados.cc
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       This file (and its header) should maybe be in `src/arrow/util` or 
`src/arrow/filesystem`.

##########
File path: format/ScanRequest.fbs
##########
@@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       This directory includes files that are meant to reflect the column 
format (e.g. types that all implementations are standardized on).  I'm not sure 
if your intention is to make the scan request a part of the specification or 
not.
   
   However, I think this was probably placed here so it would get picked up by 
thrift.  I'm not sure what a good approach to take would be and I'll defer to 
future reviewers here.

##########
File path: cpp/src/arrow/dataset/file_rados_parquet.h
##########
@@ -0,0 +1,182 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This API is EXPERIMENTAL.
+#define _FILE_OFFSET_BITS 64
+
+#pragma once
+
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <functional>
+#include <memory>
+#include <sstream>
+#include <string>
+#include <utility>
+#include <vector>
+
+#include "arrow/api.h"
+#include "arrow/compute/exec/expression.h"
+#include "arrow/dataset/dataset.h"
+#include "arrow/dataset/discovery.h"
+#include "arrow/dataset/file_base.h"
+#include "arrow/dataset/file_parquet.h"
+#include "arrow/dataset/rados.h"
+#include "arrow/dataset/scanner.h"
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/dataset/visibility.h"
+#include "arrow/filesystem/api.h"
+#include "arrow/filesystem/path_util.h"
+#include "arrow/io/api.h"
+#include "arrow/ipc/api.h"
+#include "arrow/util/iterator.h"
+#include "arrow/util/macros.h"
+
+#include "parquet/arrow/writer.h"
+#include "parquet/exception.h"
+
+namespace arrow {
+namespace dataset {
+
+class ARROW_DS_EXPORT RadosCluster {
+ public:
+  struct RadosConnectionCtx {
+    std::string ceph_config_path;
+    std::string data_pool;
+    std::string user_name;
+    std::string cluster_name;
+    std::string cls_name;
+  };
+  explicit RadosCluster(RadosConnectionCtx& ctx)
+      : ctx(ctx), rados(new RadosWrapper()), ioCtx(new IoCtxWrapper()) {}
+
+  ~RadosCluster() { Shutdown(); }
+
+  Status Connect() {
+    if (rados->init2(ctx.user_name.c_str(), ctx.cluster_name.c_str(), 0))
+      return Status::Invalid("librados::init2 returned non-zero exit code.");

Review comment:
       It might be simpler if `RadosInterface` returned `Status` instead of 
`int`.  For example, the compression_xyz.cc files convert from libxyz to 
Status.  You could also follow the pattern used in hdfs where `hdfs_internal.h` 
returns int error codes but then it is hidden by `hdfs.h` which exposes 
everything with `Status`/`Result`.

##########
File path: cpp/src/arrow/dataset/rados.h
##########
@@ -0,0 +1,132 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <rados/librados.hpp>

Review comment:
       Ideally this would be included from the `.cc` file any any instance to 
`ceph::bufferlist` would be replaced by Arrow buffers or arrays.  This helps 
keep rados out of the Arrow public API (even if the Arrow dist was built with 
RADOS support).  For example, this is done with our compression utilities and 
filesystem utilities.  Internal headers can be used if you need templates 
(though I don't see any here).

##########
File path: python/pyarrow/_rados.pxd
##########
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I wonder if we need a brand new module here or not.  For example, it 
could be in the dataset module and then it could be an error at use time if 
someone tries to use it.  Currently the parquet file format and csv file format 
live in the dataset module (even though there is already a parquet module and 
csv module).  However, those two are generally guaranteed to be installed.  I'm 
not sure which pattern to follow for rados.

##########
File path: cpp/src/arrow/adapters/arrow-rados-cls/scripts/splitter.py
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       I think this is more of a "skyhook" utility and less of a "Arrow 
utility".  It maybe shouldn't be included in this PR.

##########
File path: cpp/src/arrow/adapters/arrow-rados-cls/docs/architecture.md
##########
@@ -0,0 +1,41 @@
+<!---

Review comment:
       This document is very good for helping to understand the intent of the 
PR but I'm not sure if we want this, as written, to be merged in.  I don't 
think your goal is to embed the Skyhook project inside of Arrow correct?  I 
think the goal is only to include the Arrow-Rados integration (and the Arrow 
Ceph object class)?  I may be misunderstanding things too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to