(arrow-site) branch main updated: [Website] Add "Data wants to be free" (#586)

lidavidm Fri, 28 Feb 2025 02:59:55 -0800

This is an automated email from the ASF dual-hosted git repository.

lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git



The following commit(s) were added to refs/heads/main by this push:
     new 6d0aa845484 [Website] Add "Data wants to be free" (#586)
6d0aa845484 is described below

commit 6d0aa845484e66c013087d75e6440de942194af8
Author: David Li <[email protected]>
AuthorDate: Fri Feb 28 00:58:31 2025 -0500

    [Website] Add "Data wants to be free" (#586)
    
    Co-authored-by: Bryce Mecum <[email protected]>
    Co-authored-by: Ian Cook <[email protected]>
---
 _includes/arrow_result_transfer_series.md  |  23 +++
 _posts/2025-01-10-arrow-result-transfer.md |   2 +
 _posts/2025-02-28-data-wants-to-be-free.md | 225 +++++++++++++++++++++++++++++
 assets/data_wants_to_be_free/flowchart.png | Bin 0 -> 74155 bytes
 4 files changed, 250 insertions(+)

diff --git a/_includes/arrow_result_transfer_series.md 
b/_includes/arrow_result_transfer_series.md
new file mode 100644
index 00000000000..dddd4b4dfca
--- /dev/null
+++ b/_includes/arrow_result_transfer_series.md
@@ -0,0 +1,23 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Posts in this series:
+
+1. [How the Apache Arrow Format Accelerates Query Result Transfer]({% link 
_posts/2025-01-10-arrow-result-transfer.md %})
+1. [Data Wants to Be Free: Fast Data Exchange with Apache Arrow]({% link 
_posts/2025-02-28-data-wants-to-be-free.md %})
diff --git a/_posts/2025-01-10-arrow-result-transfer.md 
b/_posts/2025-01-10-arrow-result-transfer.md
index 52ed9610c71..fbadd73c47a 100644
--- a/_posts/2025-01-10-arrow-result-transfer.md
+++ b/_posts/2025-01-10-arrow-result-transfer.md
@@ -35,6 +35,8 @@ limitations under the License.
 
 _This is the first in a series of posts that aims to demystify the use of 
Arrow as a data interchange format for databases and query engines._
 
+{% include arrow_result_transfer_series.md %}
+
 <img src="{{ site.baseurl }}/img/arrow-result-transfer/part-1-banner.png" 
width="100%" class="img-responsive" alt="" aria-hidden="true">
 
 “Why is this taking so long?”
diff --git a/_posts/2025-02-28-data-wants-to-be-free.md 
b/_posts/2025-02-28-data-wants-to-be-free.md
new file mode 100644
index 00000000000..0d64be85ef0
--- /dev/null
+++ b/_posts/2025-02-28-data-wants-to-be-free.md
@@ -0,0 +1,225 @@
+---
+layout: post
+title: "Data Wants to Be Free: Fast Data Exchange with Apache Arrow"
+description: ""
+date: "2025-02-28 00:00:00"
+author: David Li, Ian Cook, Matt Topol
+categories: [application]
+image:
+  path: /img/arrow-result-transfer/part-1-share-image.png
+  height: 1200
+  width: 705
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<style>
+.a-header {
+  color: #984EA3;
+  font-weight: bold;
+}
+.a-data {
+  color: #377EB8;
+  font-weight: bold;
+}
+.a-length {
+  color: #FF7F00;
+  font-weight: bold;
+}
+.a-padding {
+  color: #E41A1C;
+  font-weight: bold;
+}
+</style>
+
+_This is the second in a series of posts that aims to demystify the use of
+Arrow as a data interchange format for databases and query engines._
+
+{% include arrow_result_transfer_series.md %}
+
+As data practitioners, we often find our data “held hostage”. Instead of being
+able to use data as soon as we get it, we have to spend time—time to parse and
+clean up inefficient and messy CSV files, time to wait for an outdated query
+engine to struggle with a few gigabytes of data, and time to wait for the data
+to make it across a socket. It’s that last point we’ll focus on today. In an
+age of multi-gigabit networks, why is it even a problem in the first place?
+And make no mistake, it is a problem—research by Mark Raasveldt and Hannes
+Mühleisen in their [2017 
paper](https://doi.org/10.14778/3115404.3115408)[^freepdf]
+found that some systems take over **ten minutes** to transfer a dataset that
+should only take ten *seconds*[^ten].
+
+[^freepdf]: The paper is freely available from 
[VLDB](https://www.vldb.org/pvldb/vol10/p1022-muehleisen.pdf).
+[^ten]: Figure 1 in the paper shows Hive and MongoDB taking over 600 seconds 
vs the baseline of 10 seconds for netcat to transfer the CSV file.  Of course, 
that means the comparison is not entirely fair since the CSV file is not being 
parsed, but it gives an idea of the magnitudes involved.
+
+Why are we waiting 60 times as long as we need to? [As we've argued before,
+serialization overheads plague our
+tools](https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/)—and
+Arrow can help us here. So let’s make that more concrete: we’ll compare how
+PostgreSQL and Arrow encode the same data to illustrate the impact of the data
+serialization format. Then we’ll tour various ways to build protocols with
+Arrow, like Arrow HTTP and Arrow Flight, and how you might use each of them.
+
+## PostgreSQL vs Arrow: Data Serialization
+
+Let’s compare the [PostgreSQL binary
+format](https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.4)
+and [Arrow
+IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
+on the same dataset, and show how Arrow (with all the benefit of hindsight)
+makes better trade-offs than its predecessors.
+
+When you execute a query with PostgreSQL, the client/driver uses the
+PostgreSQL wire protocol to send the query and get back the result.  Inside
+that protocol, the result set is encoded with the PostgreSQL binary 
format[^textbinary].
+
+[^textbinary]: There is a text format too, and that is often the default used 
by many clients.  We won't discuss it here.
+
+First, we’ll create a table and fill it with data:
+
+```
+postgres=# CREATE TABLE demo (id BIGINT, val TEXT, val2 BIGINT);
+CREATE TABLE
+postgres=# INSERT INTO demo VALUES (1, 'foo', 64), (2, 'a longer string', 
128), (3, 'yet another string', 10);
+INSERT 0 3
+```
+
+We can then use the COPY command to dump the raw binary data from PostgreSQL 
into a file:
+
+```
+postgres=# COPY demo TO '/tmp/demo.bin' WITH BINARY;
+COPY 3
+```
+
+Then we can annotate the actual bytes of the data based on the 
[documentation](https://www.postgresql.org/docs/current/sql-copy.html#id-1.9.3.55.9.4):
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>00000000: <span class="a-header">50 47 43 4f 50 59 0a 
ff  PGCOPY..</span>  <span class="a-header">COPY signature, flags,</span>
+00000008: <span class="a-header">0d 0a 00 00 00 00 00 00  ........</span>  
<span class="a-header">and extension</span>
+00000010: <span class="a-header">00 00 00</span> <span class="a-padding">00 
03</span> <span class="a-length">00 00 00</span>  <span 
class="a-header">...</span><span class="a-padding">..</span><span 
class="a-length">...</span>  <span class="a-padding">Values in row</span>
+00000018: <span class="a-length">08</span> <span class="a-data">00 00 00 00 00 
00 00</span>  <span class="a-length">.</span><span 
class="a-data">.......</span>  <span class="a-length">Length of value</span>
+00000020: <span class="a-data">01</span> <span class="a-length">00 00 00 
03</span> <span class="a-data">66 6f 6f</span>  <span 
class="a-data">.</span><span class="a-length">....</span><span 
class="a-data">foo</span>  <span class="a-data">Data</span>
+00000028: <span class="a-length">00 00 00 08</span> <span class="a-data">00 00 
00 00</span>  <span class="a-length">....</span><span class="a-data">....</span>
+00000030: <span class="a-data">00 00 00 40</span> <span class="a-padding">00 
03</span> <span class="a-length">00 00</span>  <span 
class="a-data">...@</span><span class="a-padding">..</span><span 
class="a-length">..</span>
+00000038: <span class="a-length">00 08</span> <span class="a-data">00 00 00 00 
00 00</span>  <span class="a-length">..</span><span class="a-data">......</span>
+00000040: <span class="a-data">00 02</span> <span class="a-length">00 00 00 
0f</span> <span class="a-data">61 20</span>  <span 
class="a-data">..</span><span class="a-length">....</span><span 
class="a-data">a </span>
+00000048: <span class="a-data">6c 6f 6e 67 65 72 20 73  longer s</span>
+00000050: <span class="a-data">74 72 69 6e 67</span> <span class="a-length">00 
00 00</span>  <span class="a-data">tring</span><span class="a-length">...</span>
+00000058: <span class="a-length">08</span> <span class="a-data">00 00 00 00 00 
00 00</span>  <span class="a-length">.</span><span class="a-data">.......</span>
+00000060: <span class="a-data">80</span> <span class="a-padding">00 03</span> 
<span class="a-length">00 00 00 08</span> <span class="a-data">00</span>  <span 
class="a-data">.</span><span class="a-padding">..</span><span 
class="a-length">....</span><span class="a-data">.</span>
+00000068: <span class="a-data">00 00 00 00 00 00 03</span> <span 
class="a-length">00</span>  <span class="a-data">.......</span><span 
class="a-length">.</span>
+00000070: <span class="a-length">00 00 12</span> <span class="a-data">79 65 74 
20 61</span>  <span class="a-length">...</span><span class="a-data">yet a</span>
+00000078: <span class="a-data">6e 6f 74 68 65 72 20 73  nother s</span>
+00000080: <span class="a-data">74 72 69 6e 67</span> <span class="a-length">00 
00 00</span>  <span class="a-data">tring</span><span class="a-length">...</span>
+00000088: <span class="a-length">08</span> <span class="a-data">00 00 00 00 00 
00 00</span>  <span class="a-length">.</span><span class="a-data">.......</span>
+00000090: <span class="a-data">0a</span> <span class="a-padding">ff ff</span>  
               <span class="a-data">.</span><span class="a-padding">..</span>   
    <span class="a-padding">End of stream</span></code></pre></div></div>
+
+Honestly, PostgreSQL’s binary format is quite understandable, and compact at 
first glance. It's just a series of length-prefixed fields. But a closer look 
isn’t so favorable. **PostgreSQL has overheads proportional to the number of 
rows and columns**:
+
+* Every row has a 2 byte prefix for the number of values in the row. *But the 
data is tabular—we already know this info, and it doesn’t change\!*
+* Every value of every row has a 4 byte prefix for the length of the following 
data, or \-1 if the value is NULL. *But we know the data types, and those don’t 
change—plus, values of most types have a fixed, known length\!*
+* All values are big-endian. *But most of our devices are little-endian, so 
the data has to be converted.*
+
+For example, a single column of int32 values would have 4 bytes of data and 6 
bytes of overhead per row—**60% is “wasted\!”**[^1] The ratio gets a little 
better with more columns (but not with more rows); in the limit we approach 
“only” 50% overhead.  And then each of the values has to be converted (even if 
endian-swapping is trivial).  To PostgreSQL’s credit, its format is at least 
cheap and easy to parse—[other 
formats](https://protobuf.dev/programming-guides/encoding/) get fancy with  
[...]
+
+How does Arrow compare? We can use 
[ADBC](https://arrow.apache.org/adbc/current/driver/postgresql.html) to pull 
the PostgreSQL table into an Arrow table, then annotate it like before:
+
+```console
+>>> import adbc_driver_postgresql.dbapi
+>>> import pyarrow.ipc
+>>> conn = adbc_driver_postgresql.dbapi.connect("...")
+>>> cur = conn.cursor()
+>>> cur.execute("SELECT * FROM demo")
+>>> data = cur.fetchallarrow()
+>>> writer = pyarrow.ipc.new_stream("demo.arrows", data.schema)
+>>> writer.write_table(data)
+>>> writer.close()
+```
+
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>00000000: <span class="a-length">ff ff ff ff d8 00 00 
00  ........  IPC message length</span>
+00000008: <span class="a-header">10 00 00 00 00 00 0a 00  ........  IPC 
schema</span>
+&vellip;         <span class="a-header">(208 bytes)</span>
+000000e0: <span class="a-length">ff ff ff ff f8 00 00 00  ........  IPC 
message length</span>
+000000e8: <span class="a-header">14 00 00 00 00 00 00 00  ........  IPC record 
batch</span>
+&vellip;         <span class="a-header">(240 bytes)</span>
+000001e0: <span class="a-data">01 00 00 00 00 00 00 00  ........  Data for 
column #1</span>
+000001e8: <span class="a-data">02 00 00 00 00 00 00 00  ........</span>
+000001f0: <span class="a-data">03 00 00 00 00 00 00 00  ........</span>
+000001f8: <span class="a-length">00 00 00 00 03 00 00 00  ........  String 
offsets</span>
+00000200: <span class="a-length">12 00 00 00 24 00 00 00  ....$...</span>
+00000208: <span class="a-data">66 6f 6f 61 20 6c 6f 6e  fooa lon  Data for 
column #2</span>
+00000210: <span class="a-data">67 65 72 20 73 74 72 69  ger stri</span>
+00000218: <span class="a-data">6e 67 79 65 74 20 61 6e  ngyet an</span>
+00000220: <span class="a-data">6f 74 68 65 72 20 73 74  other st</span>
+00000228: <span class="a-data">72 69 6e 67</span> <span class="a-padding">00 
00 00 00</span>  <span class="a-data">ring</span><span class="a-padding">....  
Alignment padding</span>
+00000230: <span class="a-data">40 00 00 00 00 00 00 00  @.......  Data for 
column #3</span>
+00000238: <span class="a-data">80 00 00 00 00 00 00 00  ........</span>
+00000240: <span class="a-data">0a 00 00 00 00 00 00 00  ........</span>
+00000248: <span class="a-length">ff ff ff ff 00 00 00 00  ........  IPC 
end-of-stream</span></code></pre></div></div>
+
+Arrow looks quite…intimidating…at first glance. There’s a giant header that 
don’t seem related to our dataset at all, plus mysterious padding that seems to 
exist solely to take up space. But the important thing is that **the overhead 
is fixed**. Whether you’re transferring one row or a billion, the overhead 
doesn’t change. And unlike PostgreSQL, **no per-value parsing is required**.
+
+Instead of putting lengths of values everywhere, Arrow groups values of the 
same column (and hence same type) together, so it just needs the length of the 
buffer[^header].  Overhead isn't added where it isn't otherwise needed.  
Strings still have a length per value.  Nullability is instead stored in a 
bitmap, which is omitted if there aren’t any NULL values (as it is here). 
Because of that, more rows of data doesn’t increase the overhead; instead, the 
more data you have, the less you pay.
+
+[^header]: That's what's being stored in that ginormous header (among other 
things)—the lengths of all the buffers.
+
+Even the header isn’t actually the disadvantage it looks like. The header 
contains the schema, which makes the data stream self-describing. With 
PostgreSQL, you need to get the schema from somewhere else. So we aren’t making 
an apples-to-apples comparison in the first place: PostgreSQL still has to 
transfer the schema, it’s just not part of the “binary format” that we’re 
looking at here[^binaryheader].
+
+[^binaryheader]: And conversely, the PGCOPY header is specific to the COPY 
command we executed to get a bulk response.
+
+There’s actually another problem with PostgreSQL: alignment. The 2 byte field 
count at the start of every row means all the 8 byte integers after it are 
unaligned. And that requires extra effort to handle properly (e.g. explicit 
unaligned load idioms), lest you suffer [undefined 
behavior](https://port70.net/~nsz/c/c11/n1570.html#6.3.2.3p7), a performance 
penalty, or even a runtime error. Arrow, on the other hand, strategically adds 
some padding to keep data aligned, and lets you use litt [...]
+
+That’s the benefit of Arrow being a standardized data format. By using Arrow 
for serialization, data coming off the wire is already in Arrow format, and can 
furthermore be directly passed on to [DuckDB](https://duckdb.org), 
[pandas](https://pandas.pydata.org), [polars](https://pola.rs), 
[cuDF](https://docs.rapids.ai/api/cudf/stable/), 
[DataFusion](https://datafusion.apache.org), or any number of systems. 
Meanwhile, even if the PostgreSQL format addressed these problems—adding 
padding to  [...]
+
+Even if you really did want to use the PostgreSQL binary format 
everywhere[^3], the documentation rather unfortunately points you to the C 
source code as the documentation. Arrow, on the other hand, has a 
[specification](https://github.com/apache/arrow/tree/main/format), 
[documentation](https://arrow.apache.org/docs/format/Columnar.html), and 
multiple [implementations](https://arrow.apache.org/docs/#implementations) 
(including third-party ones) across a dozen languages for you to pick up [...]
+
+Now, we don’t mean to pick on PostgreSQL here. Obviously, PostgreSQL is a 
full-featured database with a storied history, a different set of goals and 
constraints than Arrow, and many happy users. Arrow isn’t trying to compete in 
that space. But their domains do intersect. PostgreSQL’s wire protocol has 
[become a de facto 
standard](https://datastation.multiprocess.io/blog/2022-02-08-the-world-of-postgresql-wire-compatibility.html),
 with even brand new products like Google’s AlloyDB using  [...]
+
+## A quiver of Arrow (projects)
+
+So if Arrow is so great, how can we actually use it to build our own 
protocols? Luckily, Arrow comes with a variety of building blocks for different 
situations.
+
+* We just talked about [**Arrow 
IPC**](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
 before. Where Arrow is the in-memory format defining how arrays of data are 
laid out, er, in memory, Arrow IPC defines how to serialize and deserialize 
Arrow data so it can be sent somewhere else—whether that means being written to 
a file, to a socket, into a shared buffer, or otherwise. Arrow IPC organizes 
data as a sequence of messages, making it e [...]
+* [**Arrow HTTP**](https://github.com/apache/arrow-experiments/tree/main/http) 
is “just” streaming Arrow IPC over HTTP. The Arrow community is working on 
standardizing it, so that different clients agree on how exactly to do this. 
There’s examples of clients and servers across several languages, how to use 
HTTP Range requests, using multipart/mixed requests to send combined JSON and 
Arrow responses, and more. While not a full protocol in and of itself, it’ll 
fit right in when building RE [...]
+* [**Disassociated 
IPC**](https://arrow.apache.org/docs/format/DissociatedIPC.html) combines Arrow 
IPC with advanced network transports like [UCX](https://openucx.org/) and 
[libfabric](https://ofiwg.github.io/libfabric/). For those who require the 
absolute best performance and have the specialized hardware to boot, this 
allows you to send Arrow data at full throttle, taking advantage of 
scatter-gather, Infiniband, and more.
+* [**Arrow Flight SQL**](https://arrow.apache.org/docs/format/FlightSql.html) 
is a fully defined protocol for accessing relational databases. Think of it as 
an alternative to the full PostgreSQL wire protocol: it defines how to connect 
to a database, execute queries, fetch results, view the catalog, and so on. For 
database developers, Flight SQL provides a fully Arrow-native protocol with 
clients for several programming languages and drivers for ADBC, JDBC, and 
ODBC—all of which you don’ [...]
+* And finally, [**ADBC**](https://arrow.apache.org/docs/format/ADBC.html) 
actually isn’t a protocol. Instead, it’s an API abstraction layer for working 
with databases (like JDBC and ODBC—bet you didn’t see that coming), that’s 
Arrow-native and doesn’t require transposing or converting columnar data back 
and forth. ADBC gives you a single API to access data from multiple databases, 
whether they use Flight SQL or something else under the hood, and if a 
conversion is absolutely necessary, A [...]
+
+So to summarize:
+
+* If you’re *using* a database or other data system, you want 
[**ADBC**](https://arrow.apache.org/adbc/).
+* If you’re *building* a database, you want [**Arrow Flight 
SQL**](https://arrow.apache.org/docs/format/FlightSql.html).
+* If you’re working with specialized networking hardware (you’ll know if you 
are—that stuff doesn’t come cheap), you want the [**Disassociated IPC 
Protocol**](https://arrow.apache.org/docs/format/DissociatedIPC.html).
+* If you’re *designing* a REST-ish API, you want [**Arrow 
HTTP**](https://github.com/apache/arrow-experiments/tree/main/http).
+* And otherwise, you can roll-your-own with [**Arrow 
IPC**](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc).
+
+![A flowchart of the decision points.]({{ site.baseurl 
}}/assets/data_wants_to_be_free/flowchart.png){:class="img-responsive" 
width="100%"}
+
+## Conclusion
+
+Existing client protocols can be wasteful. Arrow offers better efficiency and 
avoids design pitfalls from the past. And Arrow makes it easy to build and 
consume data APIs with a variety of standards like Arrow IPC, Arrow HTTP, and 
ADBC. By using Arrow serialization in protocols, everyone benefits from easier, 
faster, and simpler data access, and we can avoid accidentally holding data 
captive behind slow and inefficient interfaces.
+
+---
+
+[^1]: Of course, it’s not fully wasted, as null/not-null is data as well. But 
for accounting purposes, we’ll be consistent and call lengths, padding, 
bitmaps, etc. “overhead”.
+
+[^2]: And if your data really benefits from heavy compression, you can always 
use something like Apache Parquet, which implements lots of fancy encodings to 
save space and can still be decoded to Arrow data reasonably quickly.
+
+[^3]: [And people 
do…](https://datastation.multiprocess.io/blog/2022-02-08-the-world-of-postgresql-wire-compatibility.html)
+
+[^4]: [We have some experience with the PostgreSQL wire protocol, 
too.](https://github.com/apache/arrow-adbc/blob/ed18b8b221af23c7b32312411da10f6532eb3488/c/driver/postgresql/copy/reader.h)
diff --git a/assets/data_wants_to_be_free/flowchart.png 
b/assets/data_wants_to_be_free/flowchart.png
new file mode 100644
index 00000000000..93a96722d72
Binary files /dev/null and b/assets/data_wants_to_be_free/flowchart.png differ

(arrow-site) branch main updated: [Website] Add "Data wants to be free" (#586)

Reply via email to