Re: [PR] Blog post: Fast Streaming Inserts in DuckDB with ADBC [arrow-site]

via GitHub Thu, 06 Mar 2025 13:31:34 -0800


ianmcook commented on code in PR #609:
URL: https://github.com/apache/arrow-site/pull/609#discussion_r1984080948



##########
_posts/2025-03-04-fast-streaming-inserts-in-duckdb-with-adbc.md:
##########
@@ -0,0 +1,201 @@
+---
+layout: post
+title: "Fast Streaming Inserts in DuckDB with ADBC"
+description: "ADBC enables high throughput insertion into DuckDB"
+date: "2025-03-04 00:00:00"
+author: loicalleyne
+categories: [application]
+image:
+  path: /img/adbc-duckdb/adbc-duckdb.png
+  height: 560
+  width: 1200
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Fast Streaming Inserts in DuckDB with ADBC
+
+<img src="{{ site.baseurl }}/img/adbc-duckdb/adbc-duckdb.png" width="100%" 
class="img-responsive" alt="" aria-hidden="true"> 
+# TL;DR
+
+DuckDB is rapidly becoming an essential part of data practitioners' toolbox, 
finding use cases in data engineering, machine learning, and local analytics. 
In many cases DuckDB has been used to query and process data that has already 
been saved to storage (file-based or external database) by another process. 
Arrow Database Connectivity APIs enable high-throughput data processing using 
DuckDB as the engine.
+
+# How it started
+
+The company I work for is the leading digital out-of-home marketing platform, 
including a programmatic ad tech stack. For several years, my technical 
operations team was making use of logs emitted by the real-time programmatic 
auction system in the [Apache Avro](http://avro.apache.org/) format. Over time 
we've built an entire operations and analytics back end using this data. Avro 
files are row-based which is less than ideal for analytics at scale, in fact 
it's downright painful. So much so that I developed and contributed an Avro 
reader feature to the [Apache Arrow  Go](https://github.com/apache/arrow-go) 
library to be able to convert Avro files to parquet. This data pipeline is now 
humming along transforming hundreds of GB/day from Avro to Parquet.
+
+Since "any problem in computer science can be solved with another layer of 
indirection", the original system has grown layers (like an onion) and started 
to emit other logs, this time in [Apache Parquet](https://parquet.apache.org/) 
format...  
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/adbc-duckdb/muchrejoicing.gif" width="80%" 
class="img-responsive" alt="Figure 1: And there was much rejoicing">
+  <figcaption>Figure 1: A pseudo-medieval tapestry displaying intrepid data 
practitioners rejoicing due to a columnar data storage format.</figcaption>
+</figure> 
+As we learned in Shrek, onions are like ogres: they're green, they have layers 
and they make you cry, so this rejoicing was rather short-lived, as the 
mechanism chosen to emit the parquet files was rather inefficient:
+
+* the new onion-layer (ahem...system component) sends Protobuf encoded 
messages to Kafka topics  
+* a Kafka Connect cluster with the S3 sink connector consumes topics and saves 
the parquet files to object storage
+
+Due to the firehose of data the cluster size over time grew to \> 25 nodes and 
was producing thousands of small parquet files (13 MB or smaller) an hour. This 
led to ever-increasing query latency, in some cases breaking our tools due to 
query timeouts (aka [the Small Files 
Problem](https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/#h-the-small-files-problem)).
 Not to mention that running aggregations on the raw data in our data warehouse 
wasn't fast or cheap.
+
+# DuckDB to the rescue... I think
+
+I'd used DuckDB to process and analyse Parquet data so I knew it could do that 
very quickly. Then I came across this post on LinkedIn ([Real-Time Analytics 
using Kafka and 
DuckDB](https://www.linkedin.com/posts/shubham-dhal-349626ba_real-time-analytics-with-kafka-and-duckdb-activity-7258424841538555904-xfU6)),
 where someone has built a system for near-realtime analytics in Go using 
DuckDB.
+
+The slides listed DuckDB's limitations:  
+<img src="{{ site.baseurl }}/img/adbc-duckdb/duckdb.png" width="100%" 
class="img-responsive" alt="DuckDB limitations: Single Pod, *Data should fit in 
memory, *Low Query Concurrency, *Low Ingest Rate - *Solvable with some efforts" 
aria-hidden="true"> 
+The poster's solution batches data at the application layer managing to scale 
up ingestion 100x to \~20k inserts/second, noting that they thought that using 
the DuckDB Appender API could possibly increase this 10x. So, potentially 
\~200k inserts/second. Yayyyyy...  
+
+<figure style="text-align: center;">

Review Comment:
   Oh, in that case it's probably find. How about just removing "Figure 2" from 
the alt text (since this isn't a "real" figure)?



##########
_posts/2025-03-04-fast-streaming-inserts-in-duckdb-with-adbc.md:
##########
@@ -0,0 +1,201 @@
+---
+layout: post
+title: "Fast Streaming Inserts in DuckDB with ADBC"
+description: "ADBC enables high throughput insertion into DuckDB"
+date: "2025-03-04 00:00:00"
+author: loicalleyne
+categories: [application]
+image:
+  path: /img/adbc-duckdb/adbc-duckdb.png
+  height: 560
+  width: 1200
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Fast Streaming Inserts in DuckDB with ADBC
+
+<img src="{{ site.baseurl }}/img/adbc-duckdb/adbc-duckdb.png" width="100%" 
class="img-responsive" alt="" aria-hidden="true"> 
+# TL;DR
+
+DuckDB is rapidly becoming an essential part of data practitioners' toolbox, 
finding use cases in data engineering, machine learning, and local analytics. 
In many cases DuckDB has been used to query and process data that has already 
been saved to storage (file-based or external database) by another process. 
Arrow Database Connectivity APIs enable high-throughput data processing using 
DuckDB as the engine.
+
+# How it started
+
+The company I work for is the leading digital out-of-home marketing platform, 
including a programmatic ad tech stack. For several years, my technical 
operations team was making use of logs emitted by the real-time programmatic 
auction system in the [Apache Avro](http://avro.apache.org/) format. Over time 
we've built an entire operations and analytics back end using this data. Avro 
files are row-based which is less than ideal for analytics at scale, in fact 
it's downright painful. So much so that I developed and contributed an Avro 
reader feature to the [Apache Arrow  Go](https://github.com/apache/arrow-go) 
library to be able to convert Avro files to parquet. This data pipeline is now 
humming along transforming hundreds of GB/day from Avro to Parquet.
+
+Since "any problem in computer science can be solved with another layer of 
indirection", the original system has grown layers (like an onion) and started 
to emit other logs, this time in [Apache Parquet](https://parquet.apache.org/) 
format...  
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/adbc-duckdb/muchrejoicing.gif" width="80%" 
class="img-responsive" alt="Figure 1: And there was much rejoicing">
+  <figcaption>Figure 1: A pseudo-medieval tapestry displaying intrepid data 
practitioners rejoicing due to a columnar data storage format.</figcaption>
+</figure> 
+As we learned in Shrek, onions are like ogres: they're green, they have layers 
and they make you cry, so this rejoicing was rather short-lived, as the 
mechanism chosen to emit the parquet files was rather inefficient:
+
+* the new onion-layer (ahem...system component) sends Protobuf encoded 
messages to Kafka topics  
+* a Kafka Connect cluster with the S3 sink connector consumes topics and saves 
the parquet files to object storage
+
+Due to the firehose of data the cluster size over time grew to \> 25 nodes and 
was producing thousands of small parquet files (13 MB or smaller) an hour. This 
led to ever-increasing query latency, in some cases breaking our tools due to 
query timeouts (aka [the Small Files 
Problem](https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/#h-the-small-files-problem)).
 Not to mention that running aggregations on the raw data in our data warehouse 
wasn't fast or cheap.
+
+# DuckDB to the rescue... I think
+
+I'd used DuckDB to process and analyse Parquet data so I knew it could do that 
very quickly. Then I came across this post on LinkedIn ([Real-Time Analytics 
using Kafka and 
DuckDB](https://www.linkedin.com/posts/shubham-dhal-349626ba_real-time-analytics-with-kafka-and-duckdb-activity-7258424841538555904-xfU6)),
 where someone has built a system for near-realtime analytics in Go using 
DuckDB.
+
+The slides listed DuckDB's limitations:  
+<img src="{{ site.baseurl }}/img/adbc-duckdb/duckdb.png" width="100%" 
class="img-responsive" alt="DuckDB limitations: Single Pod, *Data should fit in 
memory, *Low Query Concurrency, *Low Ingest Rate - *Solvable with some efforts" 
aria-hidden="true"> 
+The poster's solution batches data at the application layer managing to scale 
up ingestion 100x to \~20k inserts/second, noting that they thought that using 
the DuckDB Appender API could possibly increase this 10x. So, potentially 
\~200k inserts/second. Yayyyyy...  
+
+<figure style="text-align: center;">

Review Comment:
   Oh, in that case it's probably fine. How about just removing "Figure 2" from 
the alt text (since this isn't a "real" figure)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Blog post: Fast Streaming Inserts in DuckDB with ADBC [arrow-site]

Reply via email to