lidavidm commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1055655147


##########
_posts/2022-12-31-arrow-adbc.md:
##########
@@ -0,0 +1,217 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Arrow community would like to introduce version 1.0.0 of the [Arrow 
Database Connectivity (ADBC)][adbc] specification.
+ADBC is a columnar, minimal-overhead alternative to JDBC/ODBC for analytical 
applications.
+Or in other words: **ADBC is a single API for getting Arrow data in and out of 
different databases**.
+
+## Motivation
+
+Applications often use API standards like [JDBC][jdbc] and [ODBC][odbc] to 
work with databases.
+That way, they can code to the same API regardless of the underlying database, 
saving on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC API.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+6. The application iterates over the result rows using the JDBC/ODBC API.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow1.svg" width="90%" 
class="img-responsive" alt="A diagram showing the query execution flow.">
+  <figcaption>The query execution flow.</figcaption>
+</figure>
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+So generally, columnar data must be converted to rows between steps 5 and 6, 
spending resources without performing "useful" work.
+
+This mismatch is problematic for columnar database systems, such as 
ClickHouse, Dremio, DuckDB, and Google BigQuery.
+On the client side, tools such as Apache Spark and pandas would be better off 
getting columnar data directly, skipping that conversion.
+Otherwise, they're leaving performance on the table.
+At the same time, that conversion isn't always avoidable.
+Row-oriented database systems like PostgreSQL aren't going away, and these 
clients will still want to consume data from them.
+
+Developers have a few options:
+
+- *Just use JDBC/ODBC*.
+  These standards are here to stay, and it makes sense for databases to 
support them for applications that want them.
+  But when both the database and the application are columnar, that means 
converting data into rows for JDBC/ODBC, only for the client to convert them 
right back into columns!
+  Performance suffers, and developers have to spend time implementing the 
conversions.
+- *Use JDBC/ODBC to Arrow conversion libraries*.
+  Libraries like [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc] handle row 
to columnar conversions for clients.
+  But this doesn't fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Use vendor-specific protocols*.
+  For some databases, applications can use a database-specific protocol or SDK 
to directly get Arrow data.
+  For example, applications could use Dremio via [Arrow Flight 
SQL][flight-sql].
+  But client applications that want to use multiple database vendors would 
need to integrate with each of them.
+  (Look at all the [connectors](https://trino.io/docs/current/connector.html) 
that Trino implements.)
+  And databases like PostgreSQL don't offer an option supporting Arrow in the 
first place.
+
+So in the status quo, clients must choose between either tedious integration 
work or leaving performance on the table.
+
+## Introducing ADBC
+
+ADBC is an Arrow-based, vendor-netural API for interacting with databases.
+Applications that use ADBC just get Arrow data.
+They don't have to do any conversions themselves, and they don't have to 
integrate each database's specific SDK.
+
+Just like JDBC/ODBC, underneath the ADBC API are drivers that translate the 
API for specific databases.
+
+* A driver for an Arrow-native database just passes Arrow data through without 
conversion.
+* A driver for a non-Arrow-native database must convert the data to Arrow.
+  This saves the application from doing that, and the driver can optimize the 
conversion for its database.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow2.svg" alt="A diagram showing the 
query execution flow with ADBC." width="90%" class="img-responsive">
+  <figcaption>The query execution flow with two different ADBC 
drivers.</figcaption>
+</figure>
+

Review Comment:
   Updated (diagram precedes list in both places)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to