Re: [PR] [Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" [arrow-site]

via GitHub Tue, 07 Jan 2025 13:46:04 -0800


wesm commented on code in PR #569:
URL: https://github.com/apache/arrow-site/pull/569#discussion_r1906078943



##########
_posts/2025-01-07-arrow-result-transfer.md:
##########
@@ -0,0 +1,124 @@
+---
+layout: post
+title: "How the Apache Arrow Format Accelerates Query Result Transfer"
+date: "2025-01-07 00:00:00"
+author: Ian Cook, David Li, Matt Topol
+categories: [application]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+_This is the first in a series of posts that aims to demystify the use of 
Arrow as data interchange format for databases and query engines._
+
+_________________
+
+“Why is this taking so long?”
+
+This is a question that data practitioners often ponder while waiting for 
query results. It’s a question with many possible answers. Maybe your data 
source is poorly partitioned. Maybe your SaaS data warehouse is undersized. 
Maybe the query optimizer failed to translate your SQL statement into an 
efficient execution plan.
+
+But surprisingly often, the answer is that you are using an inefficient 
protocol to transfer query results to the client. In a [2017 
paper](https://www.vldb.org/pvldb/vol10/p1022-muehleisen.pdf), Mark Raasveldt 
and Hannes Mühleisen observed that query result transfer time often dominates 
query execution time, especially for larger results. However, the bottleneck is 
not where you might expect.
+
+Transferring a query result from a source to a destination involves three 
steps:
+
+1. At the source, serialize the result from its original format into a 
transfer format.
+2. Transmit the data over the network in the transfer format.[^1]
+3. At the destination, deserialize the transfer format into the target format.
+
+In the era of slower networks, the transmission step was usually the 
bottleneck, so there was little incentive to speed up the serialization and 
deserialization steps. Instead, the emphasis was on making the transferred data 
smaller, typically using compression, to reduce the transmission time. It was 
during this era that the most widely used database connectivity APIs (ODBC and 
JDBC) and database client protocols (such as the MySQL client/server protocol 
and the PostgreSQL frontend/backend protocol) were designed. But as networks 
have become faster and transmission times have dropped, the bottleneck has 
shifted to the serialization and deserialization steps.[^2] This is especially 
true for queries that produce the larger result sizes characteristic of many 
data engineering and data analytics pipelines.
+
+Yet many query results today continue to flow through legacy APIs and 
protocols that add massive serialization and deserialization (“ser/de”) 
overheads by forcing data into inefficient transfer formats. In a [2021 
paper](https://www.vldb.org/pvldb/vol14/p534-li.pdf), Tianyu Li et al. 
presented an example using ODBC and the PostgreSQL protocol in which 99.996% of 
total query time was spent on ser/de. That is arguably an extreme case, but we 
have observed 90% or higher in many real-world cases. Today, for data 
engineering and data analytics queries, there is a strong incentive to choose a 
transfer format that speeds up ser/de.
+
+Enter Arrow.
+
+The Apache Arrow open source project defines a [data 
format](https://arrow.apache.org/docs/format/Columnar.html) that is designed to 
speed up—and in some cases eliminate—ser/de in query result transfer. Since its 
creation in 2016, the Arrow format and the multi-language toolbox built around 
it have gained widespread use, but the technical details of how Arrow is able 
to slash ser/de overheads remain poorly understood. To help address this, we 
outline five key attributes of the Arrow format that make this possible.
+
+### 1. The Arrow format is columnar.
+
+In a columnar (column-oriented) data format, the values for each column in the 
data are held in contiguous blocks of memory. This is in contrast to 
row-oriented formats, in which the values for each row are held in contiguous 
blocks of memory.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl 
}}/img/arrow-result-transfer/part-1-figure-1-row-vs-column-layout.png" 
width="100%" class="img-responsive" alt="Figure 1: An illustration of 
row-oriented and column-oriented physical memory layouts of a logical table 
containing three rows and five columns.">
+  <figcaption>Figure 1: An illustration of row-oriented and column-oriented 
physical memory layouts of a logical table containing three rows and five 
columns.</figcaption>
+</figure>
+
+High-performance analytic databases, data warehouses, query engines, and 
storage systems have converged on columnar architecture because it speeds up 
the most common types of analytic queries. Examples of modern columnar query 
systems include Amazon Redshift, ClickHouse, Databricks Photon Engine, DuckDB, 
Google BigQuery, Microsoft Azure Synapse Analytics, Snowflake, and Voltron Data 
Theseus.

Review Comment:
   That's interesting. It is the "OG" modern columnar analytic database that 
kicked off a wave of other columnar analytics DBMSs, I think it's fine to leave 
things as they are



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" [arrow-site]

Reply via email to