drabastomek commented on a change in pull request #179:
URL: https://github.com/apache/arrow-site/pull/179#discussion_r791146692
##########
File path:
_posts/2022-01-28-skyhook-bringing-computation-to-storage-with-apache-arrow.md
##########
@@ -0,0 +1,219 @@
+---
+layout: post
+title: "Skyhook: Bringing Computation to Storage with Apache Arrow"
+date: "2022-01-28 00:00:00"
+author: Jayjeet Chakraborty, Carlos Maltzahn, David Li, Tom Drabas
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+CPUs, memory, storage, and network bandwidth get better every year, but
increasingly, they’re improving in different dimensions.
+Processors are faster, but their memory bandwidth hasn’t kept up; meanwhile,
cloud computing has led to storage being separated from applications across a
network link.
+This divergent evolution means we need to rethink where and when we perform
computation to best make use of the resources available to us.
+
+For example, when querying a dataset on a storage system like Ceph or Amazon
S3, all the work of filtering data gets done by the client.
+Data has to be transferred over the network, and then the client has to spend
precious CPU cycles decoding it, only to throw it away in the end due to a
filter.
+While formats like Apache Parquet enable some optimizations here,
fundamentally, the responsibility is all on the client.
Review comment:
I'd drop 'here': "While formats like Apache Parquet enable some
optimizations, fundamentally, the responsibility is all on the client."
##########
File path:
_posts/2022-01-28-skyhook-bringing-computation-to-storage-with-apache-arrow.md
##########
@@ -0,0 +1,219 @@
+---
+layout: post
+title: "Skyhook: Bringing Computation to Storage with Apache Arrow"
+date: "2022-01-28 00:00:00"
+author: Jayjeet Chakraborty, Carlos Maltzahn, David Li, Tom Drabas
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+CPUs, memory, storage, and network bandwidth get better every year, but
increasingly, they’re improving in different dimensions.
+Processors are faster, but their memory bandwidth hasn’t kept up; meanwhile,
cloud computing has led to storage being separated from applications across a
network link.
+This divergent evolution means we need to rethink where and when we perform
computation to best make use of the resources available to us.
+
+For example, when querying a dataset on a storage system like Ceph or Amazon
S3, all the work of filtering data gets done by the client.
+Data has to be transferred over the network, and then the client has to spend
precious CPU cycles decoding it, only to throw it away in the end due to a
filter.
+While formats like Apache Parquet enable some optimizations here,
fundamentally, the responsibility is all on the client.
+Meanwhile, even though the storage system has its own compute capabilities,
it’s relegated to just serving “dumb bytes”.
+
+Thanks to the [Center for Research in Open Source Software][cross] (CROSS) at
the University of California, Santa Cruz, Apache Arrow 7.0.0 includes Skyhook,
an [Arrow Datasets][dataset] extension that solves this problem by using the
storage layer to reduce client resource utilization.
+We’ll examine the developments surrounding Skyhook as well as how Skyhook
works.
+
+## Introducing Programmable Storage
+
+Skyhook is an example of programmable storage: exposing higher-level
functionality from storage systems for clients to build upon.
+This allows us to make better use of existing resources (both hardware and
development effort) in such systems, reduces the implementation burden of
common operations for each client, and enables such operations to scale with
the storage layer.
+
+Historically, big data systems like Apache Hadoop have tried to colocate
computation and storage for efficiency.
+More recently, cloud and distributed computing have disaggregated computation
and storage for flexibility and scalability, but at a performance cost.
+Programmable storage strikes a balance between these goals, allowing some
operations to be run right next to the data while still keeping data and
compute separate at a higher level.
+
+In particular, Skyhook builds on [Ceph][ceph], a distributed storage system
that scales to exabytes of data while being reliable and flexible.
+With its Object Class SDK, Ceph enables programmable storage by allowing
extensions that define new object types with custom functionality.
+
+## Skyhook Architecture
+
+Let’s look at how Skyhook applies these ideas.
+Overall, the idea is simple: the client should be able to ask Ceph to perform
basic operations like decoding files, filtering the data, and selecting columns.
+That way, the work gets done using existing storage cluster resources, which
means it’s both adjacent to the data and can scale with the cluster size.
+Also, this reduces the data transferred over the network, and of course
reduces the client workload.
+
+On the storage system side, Skyhook uses the Ceph Object Class SDK to define
scan operations on data stored in Parquet or Feather format.
+To implement these operations, Skyhook first implements a file system shim in
Ceph’s object storage layer, then uses the existing filtering and projection
capabilities of the Arrow Datasets library on top of that shim.
+
+Then, Skyhook defines a custom “file format” in the Arrow Datasets layer.
+Queries against such files get translated to direct requests to Ceph using
those new operations, bypassing the traditional POSIX file system layer.
+After decoding, filtering, and projection, Ceph directly sends the client
Arrow record batches, minimizing CPU overhead for encoding/decoding—another
optimization Arrow makes possible.
Review comment:
Rewrite? "After decoding, filtering, and projecting, Ceph sends the
Arrow record batches directly to the client,..."
##########
File path:
_posts/2022-01-28-skyhook-bringing-computation-to-storage-with-apache-arrow.md
##########
@@ -0,0 +1,219 @@
+---
+layout: post
+title: "Skyhook: Bringing Computation to Storage with Apache Arrow"
+date: "2022-01-28 00:00:00"
+author: Jayjeet Chakraborty, Carlos Maltzahn, David Li, Tom Drabas
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+CPUs, memory, storage, and network bandwidth get better every year, but
increasingly, they’re improving in different dimensions.
+Processors are faster, but their memory bandwidth hasn’t kept up; meanwhile,
cloud computing has led to storage being separated from applications across a
network link.
+This divergent evolution means we need to rethink where and when we perform
computation to best make use of the resources available to us.
+
+For example, when querying a dataset on a storage system like Ceph or Amazon
S3, all the work of filtering data gets done by the client.
+Data has to be transferred over the network, and then the client has to spend
precious CPU cycles decoding it, only to throw it away in the end due to a
filter.
+While formats like Apache Parquet enable some optimizations here,
fundamentally, the responsibility is all on the client.
+Meanwhile, even though the storage system has its own compute capabilities,
it’s relegated to just serving “dumb bytes”.
+
+Thanks to the [Center for Research in Open Source Software][cross] (CROSS) at
the University of California, Santa Cruz, Apache Arrow 7.0.0 includes Skyhook,
an [Arrow Datasets][dataset] extension that solves this problem by using the
storage layer to reduce client resource utilization.
+We’ll examine the developments surrounding Skyhook as well as how Skyhook
works.
+
+## Introducing Programmable Storage
+
+Skyhook is an example of programmable storage: exposing higher-level
functionality from storage systems for clients to build upon.
+This allows us to make better use of existing resources (both hardware and
development effort) in such systems, reduces the implementation burden of
common operations for each client, and enables such operations to scale with
the storage layer.
+
+Historically, big data systems like Apache Hadoop have tried to colocate
computation and storage for efficiency.
+More recently, cloud and distributed computing have disaggregated computation
and storage for flexibility and scalability, but at a performance cost.
+Programmable storage strikes a balance between these goals, allowing some
operations to be run right next to the data while still keeping data and
compute separate at a higher level.
+
+In particular, Skyhook builds on [Ceph][ceph], a distributed storage system
that scales to exabytes of data while being reliable and flexible.
+With its Object Class SDK, Ceph enables programmable storage by allowing
extensions that define new object types with custom functionality.
+
+## Skyhook Architecture
+
+Let’s look at how Skyhook applies these ideas.
+Overall, the idea is simple: the client should be able to ask Ceph to perform
basic operations like decoding files, filtering the data, and selecting columns.
+That way, the work gets done using existing storage cluster resources, which
means it’s both adjacent to the data and can scale with the cluster size.
+Also, this reduces the data transferred over the network, and of course
reduces the client workload.
+
+On the storage system side, Skyhook uses the Ceph Object Class SDK to define
scan operations on data stored in Parquet or Feather format.
+To implement these operations, Skyhook first implements a file system shim in
Ceph’s object storage layer, then uses the existing filtering and projection
capabilities of the Arrow Datasets library on top of that shim.
+
+Then, Skyhook defines a custom “file format” in the Arrow Datasets layer.
+Queries against such files get translated to direct requests to Ceph using
those new operations, bypassing the traditional POSIX file system layer.
+After decoding, filtering, and projection, Ceph directly sends the client
Arrow record batches, minimizing CPU overhead for encoding/decoding—another
optimization Arrow makes possible.
+The record batches use Arrow’s compression support to further save bandwidth.
+
+<figure>
+ <img src="{{ site.baseurl }}/img/20220121-skyhook-architecture.png"
+ alt="Skyhook Architecture"
+ width="100%" class="img-responsive">
+ <figcaption markdown="1">
+Skyhook extends Ceph and Arrow Datasets to push queries down to Ceph, reducing
the client workload and network traffic.
+(Figure sourced from [“SkyhookDM is now a part of Apache Arrow!”][medium].)
+ </figcaption>
+</figure>
+
+Skyhook also optimizes how Parquet files in particular are stored.
+Parquet files consist of a series of row groups, which each contain a chunk of
the rows in a file.
+When storing such files, Skyhook either pads or splits them so that each row
group is stored as its own Ceph object.
+By striping or splitting the file in this way, we can parallelize scanning at
sub-file granularity across the Ceph nodes for further performance improvements.
+
+## Applications
+
+In benchmarks, Skyhook has minimal storage-side CPU overhead and virtually
eliminates client-side CPU usage.
+And scaling the storage cluster decreases query latency commensurately.
+For systems like Dask that use the Arrow Datasets API, this means that just by
switching to the Skyhook file format, we can speed up dataset scans, reduce the
amount of data that needs to be transferred, and free up CPU resources for
computations.
+
+<figure>
+ <img src="{{ site.baseurl }}/img/20220121-skyhook-cpu.png"
+ alt="In benchmarks, Skyhook reduces client CPU usage while minimally
impacting storage cluster CPU usage."
+ width="100%" class="img-responsive">
+ <figcaption>
+ Skyhook frees the client CPU to do useful work, while minimally impacting
the work done by the storage machines.
+ The client still does some work in decompressing the LZ4-compressed record
batches sent by Skyhook.
+ (Note that the storage cluster plot is cumulative.)
+ </figcaption>
+</figure>
+
+Of course, the ideas behind Skyhook apply to other systems adjacent to and
beyond Apache Arrow.
+For example, “lakehouse” systems like Apache Iceberg and Delta Lake also build
on distributed storage systems, and can naturally benefit from Skyhook to
offload computation.
+Additionally, in-memory SQL-based query engines like [DuckDB][duckdb], which
integrate seamlessly with Apache Arrow, can benefit from Skyhook by offloading
portions of SQL queries.
+
+## Summary and Acknowledgements
+
+Skyhook, available in Arrow 7.0, builds on research into programmable storage
systems.
Review comment:
Should we use the full version number? 7.0.0?
##########
File path:
_posts/2022-01-28-skyhook-bringing-computation-to-storage-with-apache-arrow.md
##########
@@ -0,0 +1,219 @@
+---
+layout: post
+title: "Skyhook: Bringing Computation to Storage with Apache Arrow"
+date: "2022-01-28 00:00:00"
+author: Jayjeet Chakraborty, Carlos Maltzahn, David Li, Tom Drabas
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+CPUs, memory, storage, and network bandwidth get better every year, but
increasingly, they’re improving in different dimensions.
+Processors are faster, but their memory bandwidth hasn’t kept up; meanwhile,
cloud computing has led to storage being separated from applications across a
network link.
+This divergent evolution means we need to rethink where and when we perform
computation to best make use of the resources available to us.
+
+For example, when querying a dataset on a storage system like Ceph or Amazon
S3, all the work of filtering data gets done by the client.
+Data has to be transferred over the network, and then the client has to spend
precious CPU cycles decoding it, only to throw it away in the end due to a
filter.
+While formats like Apache Parquet enable some optimizations here,
fundamentally, the responsibility is all on the client.
+Meanwhile, even though the storage system has its own compute capabilities,
it’s relegated to just serving “dumb bytes”.
+
+Thanks to the [Center for Research in Open Source Software][cross] (CROSS) at
the University of California, Santa Cruz, Apache Arrow 7.0.0 includes Skyhook,
an [Arrow Datasets][dataset] extension that solves this problem by using the
storage layer to reduce client resource utilization.
+We’ll examine the developments surrounding Skyhook as well as how Skyhook
works.
+
+## Introducing Programmable Storage
+
+Skyhook is an example of programmable storage: exposing higher-level
functionality from storage systems for clients to build upon.
+This allows us to make better use of existing resources (both hardware and
development effort) in such systems, reduces the implementation burden of
common operations for each client, and enables such operations to scale with
the storage layer.
+
+Historically, big data systems like Apache Hadoop have tried to colocate
computation and storage for efficiency.
+More recently, cloud and distributed computing have disaggregated computation
and storage for flexibility and scalability, but at a performance cost.
+Programmable storage strikes a balance between these goals, allowing some
operations to be run right next to the data while still keeping data and
compute separate at a higher level.
+
+In particular, Skyhook builds on [Ceph][ceph], a distributed storage system
that scales to exabytes of data while being reliable and flexible.
+With its Object Class SDK, Ceph enables programmable storage by allowing
extensions that define new object types with custom functionality.
+
+## Skyhook Architecture
+
+Let’s look at how Skyhook applies these ideas.
+Overall, the idea is simple: the client should be able to ask Ceph to perform
basic operations like decoding files, filtering the data, and selecting columns.
+That way, the work gets done using existing storage cluster resources, which
means it’s both adjacent to the data and can scale with the cluster size.
+Also, this reduces the data transferred over the network, and of course
reduces the client workload.
+
+On the storage system side, Skyhook uses the Ceph Object Class SDK to define
scan operations on data stored in Parquet or Feather format.
+To implement these operations, Skyhook first implements a file system shim in
Ceph’s object storage layer, then uses the existing filtering and projection
capabilities of the Arrow Datasets library on top of that shim.
+
+Then, Skyhook defines a custom “file format” in the Arrow Datasets layer.
+Queries against such files get translated to direct requests to Ceph using
those new operations, bypassing the traditional POSIX file system layer.
+After decoding, filtering, and projection, Ceph directly sends the client
Arrow record batches, minimizing CPU overhead for encoding/decoding—another
optimization Arrow makes possible.
+The record batches use Arrow’s compression support to further save bandwidth.
+
+<figure>
+ <img src="{{ site.baseurl }}/img/20220121-skyhook-architecture.png"
+ alt="Skyhook Architecture"
+ width="100%" class="img-responsive">
+ <figcaption markdown="1">
+Skyhook extends Ceph and Arrow Datasets to push queries down to Ceph, reducing
the client workload and network traffic.
+(Figure sourced from [“SkyhookDM is now a part of Apache Arrow!”][medium].)
+ </figcaption>
+</figure>
+
+Skyhook also optimizes how Parquet files in particular are stored.
+Parquet files consist of a series of row groups, which each contain a chunk of
the rows in a file.
+When storing such files, Skyhook either pads or splits them so that each row
group is stored as its own Ceph object.
+By striping or splitting the file in this way, we can parallelize scanning at
sub-file granularity across the Ceph nodes for further performance improvements.
+
+## Applications
+
+In benchmarks, Skyhook has minimal storage-side CPU overhead and virtually
eliminates client-side CPU usage.
+And scaling the storage cluster decreases query latency commensurately.
Review comment:
I'd cut the 'And' from the beginning and start the sentence as "Scaling
the storage cluster ..."
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]