This is an automated email from the ASF dual-hosted git repository.
lidavidm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/master by this push:
new 36ef56c ARROW-15339: [Website] Add Skyhook blog post
36ef56c is described below
commit 36ef56c004fefb2a367ea1adfbba129e44656018
Author: David Li <[email protected]>
AuthorDate: Mon Jan 31 14:39:55 2022 -0500
ARROW-15339: [Website] Add Skyhook blog post
Closes #179 from lidavidm/arrow-15339:
Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
---
...ing-computation-to-storage-with-apache-arrow.md | 219 +++++++++++++++++++++
css/main.scss | 9 +
img/20220131-skyhook-architecture.png | Bin 0 -> 163992 bytes
img/20220131-skyhook-cpu.png | Bin 0 -> 124365 bytes
4 files changed, 228 insertions(+)
diff --git
a/_posts/2022-01-31-skyhook-bringing-computation-to-storage-with-apache-arrow.md
b/_posts/2022-01-31-skyhook-bringing-computation-to-storage-with-apache-arrow.md
new file mode 100644
index 0000000..b7c76c2
--- /dev/null
+++
b/_posts/2022-01-31-skyhook-bringing-computation-to-storage-with-apache-arrow.md
@@ -0,0 +1,219 @@
+---
+layout: post
+title: "Skyhook: Bringing Computation to Storage with Apache Arrow"
+date: "2022-01-31 00:00:00"
+author: Jayjeet Chakraborty, Carlos Maltzahn, David Li, Tom Drabas
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+CPUs, memory, storage, and network bandwidth get better every year, but
increasingly, they’re improving in different dimensions.
+Processors are faster, but their memory bandwidth hasn’t kept up; meanwhile,
cloud computing has led to storage being separated from applications across a
network link.
+This divergent evolution means we need to rethink where and when we perform
computation to best make use of the resources available to us.
+
+For example, when querying a dataset on a storage system like Ceph or Amazon
S3, all the work of filtering data gets done by the client.
+Data has to be transferred over the network, and then the client has to spend
precious CPU cycles decoding it, only to throw it away in the end due to a
filter.
+While formats like Apache Parquet enable some optimizations, fundamentally,
the responsibility is all on the client.
+Meanwhile, even though the storage system has its own compute capabilities,
it’s relegated to just serving “dumb bytes”.
+
+Thanks to the [Center for Research in Open Source Software][cross] (CROSS) at
the University of California, Santa Cruz, Apache Arrow 7.0.0 includes Skyhook,
an [Arrow Datasets][dataset] extension that solves this problem by using the
storage layer to reduce client resource utilization.
+We’ll examine the developments surrounding Skyhook as well as how Skyhook
works.
+
+## Introducing Programmable Storage
+
+Skyhook is an example of programmable storage: exposing higher-level
functionality from storage systems for clients to build upon.
+This allows us to make better use of existing resources (both hardware and
development effort) in such systems, reduces the implementation burden of
common operations for each client, and enables such operations to scale with
the storage layer.
+
+Historically, big data systems like Apache Hadoop have tried to colocate
computation and storage for efficiency.
+More recently, cloud and distributed computing have disaggregated computation
and storage for flexibility and scalability, but at a performance cost.
+Programmable storage strikes a balance between these goals, allowing some
operations to be run right next to the data while still keeping data and
compute separate at a higher level.
+
+In particular, Skyhook builds on [Ceph][ceph], a distributed storage system
that scales to exabytes of data while being reliable and flexible.
+With its Object Class SDK, Ceph enables programmable storage by allowing
extensions that define new object types with custom functionality.
+
+## Skyhook Architecture
+
+Let’s look at how Skyhook applies these ideas.
+Overall, the idea is simple: the client should be able to ask Ceph to perform
basic operations like decoding files, filtering the data, and selecting columns.
+That way, the work gets done using existing storage cluster resources, which
means it’s both adjacent to the data and can scale with the cluster size.
+Also, this reduces the data transferred over the network, and of course
reduces the client workload.
+
+On the storage system side, Skyhook uses the Ceph Object Class SDK to define
scan operations on data stored in Parquet or Feather format.
+To implement these operations, Skyhook first implements a file system shim in
Ceph’s object storage layer, then uses the existing filtering and projection
capabilities of the Arrow Datasets library on top of that shim.
+
+Then, Skyhook defines a custom “file format” in the Arrow Datasets layer.
+Queries against such files get translated to direct requests to Ceph using
those new operations, bypassing the traditional POSIX file system layer.
+After decoding, filtering, and projecting, Ceph sends the Arrow record batches
directly to the client, minimizing CPU overhead for encoding/decoding—another
optimization Arrow makes possible.
+The record batches use Arrow’s compression support to further save bandwidth.
+
+<figure>
+ <img src="{{ site.baseurl }}/img/20220131-skyhook-architecture.png"
+ alt="Skyhook Architecture"
+ width="100%" class="img-responsive">
+ <figcaption markdown="1">
+Skyhook extends Ceph and Arrow Datasets to push queries down to Ceph, reducing
the client workload and network traffic.
+(Figure sourced from [“SkyhookDM is now a part of Apache Arrow!”][medium].)
+ </figcaption>
+</figure>
+
+Skyhook also optimizes how Parquet files in particular are stored.
+Parquet files consist of a series of row groups, which each contain a chunk of
the rows in a file.
+When storing such files, Skyhook either pads or splits them so that each row
group is stored as its own Ceph object.
+By striping or splitting the file in this way, we can parallelize scanning at
sub-file granularity across the Ceph nodes for further performance improvements.
+
+## Applications
+
+In benchmarks, Skyhook has minimal storage-side CPU overhead and virtually
eliminates client-side CPU usage.
+Scaling the storage cluster decreases query latency commensurately.
+For systems like Dask that use the Arrow Datasets API, this means that just by
switching to the Skyhook file format, we can speed up dataset scans, reduce the
amount of data that needs to be transferred, and free up CPU resources for
computations.
+
+<figure>
+ <img src="{{ site.baseurl }}/img/20220131-skyhook-cpu.png"
+ alt="In benchmarks, Skyhook reduces client CPU usage while minimally
impacting storage cluster CPU usage."
+ width="100%" class="img-responsive">
+ <figcaption>
+ Skyhook frees the client CPU to do useful work, while minimally impacting
the work done by the storage machines.
+ The client still does some work in decompressing the LZ4-compressed record
batches sent by Skyhook.
+ (Note that the storage cluster plot is cumulative.)
+ </figcaption>
+</figure>
+
+Of course, the ideas behind Skyhook apply to other systems adjacent to and
beyond Apache Arrow.
+For example, “lakehouse” systems like Apache Iceberg and Delta Lake also build
on distributed storage systems, and can naturally benefit from Skyhook to
offload computation.
+Additionally, in-memory SQL-based query engines like [DuckDB][duckdb], which
integrate seamlessly with Apache Arrow, can benefit from Skyhook by offloading
portions of SQL queries.
+
+## Summary and Acknowledgements
+
+Skyhook, available in Arrow 7.0.0, builds on research into programmable
storage systems.
+By pushing filters and projections to the storage layer, we can speed up
dataset scans by freeing precious CPU resources on the client, reducing the
amount of data sent across the network, and better utilizing the scalability of
systems like Ceph.
+To get started, just [build Arrow][arrow-build] with Skyhook enabled, deploy
the Skyhook object class extensions to Ceph (see “Usage” in the [announcement
post][medium]), and then use the `SkyhookFileFormat` to construct an Arrow
dataset.
+A small code example is shown here.
+
+```cpp
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <arrow/compute/api.h>
+#include <arrow/dataset/api.h>
+#include <arrow/filesystem/api.h>
+#include <arrow/table.h>
+#include <skyhook/client/file_skyhook.h>
+
+#include <cstdlib>
+#include <iostream>
+#include <memory>
+#include <string>
+
+namespace cp = arrow::compute;
+namespace ds = arrow::dataset;
+namespace fs = arrow::fs;
+
+// Demonstrate reading a dataset via Skyhook.
+arrow::Status ScanDataset() {
+ // Configure SkyhookFileFormat to connect to our Ceph cluster.
+ std::string ceph_config_path = "/etc/ceph/ceph.conf";
+ std::string ceph_data_pool = "cephfs_data";
+ std::string ceph_user_name = "client.admin";
+ std::string ceph_cluster_name = "ceph";
+ std::string ceph_cls_name = "skyhook";
+ std::shared_ptr<skyhook::RadosConnCtx> rados_ctx =
+ std::make_shared<skyhook::RadosConnCtx>(ceph_config_path, ceph_data_pool,
+ ceph_user_name,
ceph_cluster_name,
+ ceph_cls_name);
+ ARROW_ASSIGN_OR_RAISE(auto format,
+ skyhook::SkyhookFileFormat::Make(rados_ctx,
"parquet"));
+
+ // Create the filesystem.
+ std::string root;
+ ARROW_ASSIGN_OR_RAISE(auto fs,
fs::FileSystemFromUri("file:///mnt/cephfs/nyc", &root));
+
+ // Create our dataset.
+ fs::FileSelector selector;
+ selector.base_dir = root;
+ selector.recursive = true;
+
+ ds::FileSystemFactoryOptions options;
+ options.partitioning = std::make_shared<ds::HivePartitioning>(
+ arrow::schema({arrow::field("payment_type", arrow::int32()),
+ arrow::field("VendorID", arrow::int32())}));
+ ARROW_ASSIGN_OR_RAISE(auto factory,
+ ds::FileSystemDatasetFactory::Make(fs,
std::move(selector),
+ std::move(format),
options));
+
+ ds::InspectOptions inspect_options;
+ ds::FinishOptions finish_options;
+ ARROW_ASSIGN_OR_RAISE(auto schema, factory->Inspect(inspect_options));
+ ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish(finish_options));
+
+ // Scan the dataset.
+ auto filter = cp::greater(cp::field_ref("payment_type"), cp::literal(2));
+ ARROW_ASSIGN_OR_RAISE(auto scanner_builder, dataset->NewScan());
+ ARROW_RETURN_NOT_OK(scanner_builder->Filter(filter));
+ ARROW_RETURN_NOT_OK(scanner_builder->UseThreads(true));
+ ARROW_ASSIGN_OR_RAISE(auto scanner, scanner_builder->Finish());
+
+ ARROW_ASSIGN_OR_RAISE(auto table, scanner->ToTable());
+
+ std::cout << "Got " << table->num_rows() << " rows" << std::endl;
+ return arrow::Status::OK();
+}
+
+int main(int, char**) {
+ auto status = ScanDataset();
+ if (!status.ok()) {
+ std::cerr << status.message() << std::endl;
+ return EXIT_FAILURE;
+ }
+ return EXIT_SUCCESS;
+}
+```
+
+We would like to acknowledge Ivo Jimenez, Jeff LeFevre, Michael Sevilla, and
Noah Watkins for their contributions to this project.
+
+This work was supported in part by the National Science Foundation under
Cooperative Agreement OAC-1836650, the US Department of Energy ASCR
DE-NA0003525 (FWP 20-023266), and the Center for Research in Open Source
Software ([cross.ucsc.edu][cross]).
+
+For more information, see these papers and articles:
+
+- [SkyhookDM: Data Processing in Ceph with Programmable Storage.][usenix]
(USENIX _;login:_ issue Summer 2020, Vol. 45, No. 2)
+- [SkyhookDM is now a part of Apache Arrow!][medium] (Medium)
+- [Towards an Arrow-native Storage System.][arxiv] (arXiv.org)
+
+[arrow-build]: https://arrow.apache.org/docs/developers/cpp/building.html
+[arxiv]: https://arxiv.org/abs/2105.09894
+[ceph]: https://ceph.io/en/
+[cross]: https://cross.ucsc.edu/
+[duckdb]: {{ site.baseurl }}/{% link _posts/2021-12-3-arrow-duckdb.md %}
+[dataset]: https://arrow.apache.org/docs/cpp/dataset.html
+[medium]:
https://jayjeetc.medium.com/skyhookdm-is-now-a-part-of-apache-arrow-e5d7b9a810ba
+[usenix]: https://www.usenix.org/publications/login/summer2020/lefevre
diff --git a/css/main.scss b/css/main.scss
index 345069b..202f977 100644
--- a/css/main.scss
+++ b/css/main.scss
@@ -98,6 +98,15 @@ p a code {
color: inherit;
}
+figure {
+ figcaption {
+ font-style: italic;
+ margin: 0 auto;
+ text-align: center;
+ width: 90%;
+ }
+}
+
.social-badges iframe {
vertical-align: middle;
}
diff --git a/img/20220131-skyhook-architecture.png
b/img/20220131-skyhook-architecture.png
new file mode 100644
index 0000000..117dea3
Binary files /dev/null and b/img/20220131-skyhook-architecture.png differ
diff --git a/img/20220131-skyhook-cpu.png b/img/20220131-skyhook-cpu.png
new file mode 100644
index 0000000..f0c898c
Binary files /dev/null and b/img/20220131-skyhook-cpu.png differ