arrow-site git commit: Add turbodbc guest blog post

wesm Fri, 16 Jun 2017 01:42:44 -0700

Repository: arrow-site
Updated Branches:
  refs/heads/asf-site 2316712e9 -> 22ab633ab



Add turbodbc guest blog post


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/22ab633a
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/22ab633a
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/22ab633a

Branch: refs/heads/asf-site
Commit: 22ab633ab6352c8b38026cfe0e73fb8289d56a5e
Parents: 2316712
Author: Wes McKinney <wes.mckin...@twosigma.com>
Authored: Fri Jun 16 04:42:10 2017 -0400
Committer: Wes McKinney <wes.mckin...@twosigma.com>
Committed: Fri Jun 16 04:42:10 2017 -0400

----------------------------------------------------------------------
 blog/2017/06/16/turbodbc-arrow/index.html | 219 +++++++++++++++++++++++++
 blog/index.html                           | 111 +++++++++++++
 docs/ipc.html                             |  29 +---
 docs/memory_layout.html                   |  18 +-
 feed.xml                                  |  81 ++++++++-
 img/turbodbc_arrow.png                    | Bin 0 -> 75697 bytes
 6 files changed, 425 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/blog/2017/06/16/turbodbc-arrow/index.html
----------------------------------------------------------------------
diff --git a/blog/2017/06/16/turbodbc-arrow/index.html 
b/blog/2017/06/16/turbodbc-arrow/index.html
new file mode 100644
index 0000000..84e3d43
--- /dev/null
+++ b/blog/2017/06/16/turbodbc-arrow/index.html
@@ -0,0 +1,219 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Connecting Relational Databases to the Apache Arrow World with 
turbodbc</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.4.3">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <title>Apache Arrow Homepage</title>
+    <link rel="stylesheet" 
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.2.1.min.js";
+            integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="
+            crossorigin="anonymous"></script>
+    <script src="/assets/javascripts/bootstrap.min.js"></script>
+  </head>
+
+
+
+<body class="wrap">
+  <div class="container">
+    <nav class="navbar navbar-default">
+  <div class="container-fluid">
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle" data-toggle="collapse" 
data-target="#arrow-navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+      <a class="navbar-brand" href="/">Apache 
Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Project Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/install/">Install</a></li>
+            <li><a href="/blog/">Blog</a></li>
+            <li><a href="/release/">Releases</a></li>
+            <li><a href="https://issues.apache.org/jira/browse/ARROW";>Issue 
Tracker</a></li>
+            <li><a href="https://github.com/apache/arrow";>Source Code</a></li>
+            <li><a 
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/";>Mailing List</a></li>
+            <li><a href="https://apachearrowslackin.herokuapp.com";>Slack 
Channel</a></li>
+            <li><a href="/committers/">Committers</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Specification<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/memory_layout.html">Memory Layout</a></li>
+            <li><a href="/docs/metadata.html">Metadata</a></li>
+            <li><a href="/docs/ipc.html">Messaging / IPC</a></li>
+          </ul>
+        </li>
+
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Documentation<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/python">Python</a></li>
+            <li><a href="/docs/cpp">C++ API</a></li>
+            <li><a href="/docs/java">Java API</a></li>
+            <li><a href="/docs/c_glib">C GLib API</a></li>
+          </ul>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">ASF Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="http://www.apache.org/";>ASF Website</a></li>
+            <li><a href="http://www.apache.org/licenses/";>License</a></li>
+            <li><a 
href="http://www.apache.org/foundation/sponsorship.html";>Donate</a></li>
+            <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+            <li><a href="http://www.apache.org/security/";>Security</a></li>
+          </ul>
+        </li>
+      </ul>
+      <a href="http://www.apache.org/";>
+        <img style="float:right;" src="/img/asf_logo.svg" width="120px"/>
+      </a>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+
+    <h2>
+      Connecting Relational Databases to the Apache Arrow World with turbodbc
+      <a href="/blog/2017/06/16/turbodbc-arrow/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            16 Jun 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://github.com/MathMagique";><i class="fa fa-user"></i> 
Michael KÃ¶nig (MathMagique)</a>
+        </div>
+      </div>
+    </div>
+
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/mathmagique";>Michael KÃ¶nig</a> is the lead 
developer of the <a href="https://github.com/blue-yonder/turbodbc";>turbodbc 
project</a></em></p>
+
+<p>The <a href="https://arrow.apache.org/";>Apache Arrow</a> project set out to 
become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+<a href="https://github.com/blue-yonder/turbodbc";>turbodbc</a> brings Apache 
Arrow support to these databases using a much
+older, more specialized data exchange layer: <a 
href="https://en.wikipedia.org/wiki/Open_Database_Connectivity";>ODBC</a>.</p>
+
+<p>ODBC is a database interface that offers developers the option to transfer 
data
+either in row-wise or column-wise fashion. Previous Python ODBC modules 
typically
+use the row-wise approach, and often trade repeated database roundtrips for 
simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.</p>
+
+<p>In contrast, turbodbc was designed to leverage columnar data processing 
from day
+one. Naturally, this implies using the columnar portion of the ODBC API. 
Equally
+important, however, is to find new ways of providing columnar data to Python 
users
+that exceed the capabilities of the row-wise API mandated by Pythonâs <a 
href="https://www.python.org/dev/peps/pep-0249/";>PEP 249</a>.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span 
class="kn">from</span> <span class="nn">turbodbc</span> <span 
class="kn">import</span> <span class="n">connect</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">connection</span> <span 
class="o">=</span> <span class="n">connect</span><span class="p">(</span><span 
class="n">dsn</span><span class="o">=</span><span class="s">"My columnar 
database"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span> <span 
class="o">=</span> <span class="n">connection</span><span 
class="o">.</span><span class="n">cursor</span><span class="p">()</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span 
class="o">.</span><span class="n">execute</span><span class="p">(</span><span 
class="s">"SELECT some_integers, some_strings FROM my_table"</span><span 
class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span 
class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span>
+<span class="n">pyarrow</span><span class="o">.</span><span 
class="n">Table</span>
+<span class="n">some_integers</span><span class="p">:</span> <span 
class="n">int64</span>
+<span class="n">some_strings</span><span class="p">:</span> <span 
class="n">string</span>
+</code></pre>
+</div>
+
+<p>With this new addition, the data flow for a result set of a typical SELECT 
query
+is like this:</p>
+<ul>
+  <li>The database prepares the result set and exposes it to the ODBC driver 
using
+either row-wise or column-wise storage.</li>
+  <li>Turbodbc has the ODBC driver write chunks of the result set into 
columnar buffers.</li>
+  <li>These buffers are exposed to turbodbcâs Apache Arrow frontend. This 
frontend
+will create an Arrow table and fill in the buffered values.</li>
+  <li>The previous steps are repeated until the entire result set is 
retrieved.</li>
+</ul>
+
+<p><img src="/img/turbodbc_arrow.png" alt="Data flow from relational databases 
to Python with turbodbc and the Apache Arrow frontend" class="img-responsive" 
width="75%" /></p>
+
+<p>In practice, it is possible to achieve the following ideal situation: A 
64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A 
huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver 
directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates 
these values
+by copying the entire 64-bit buffer into a free portion of an Arrow tableâs 
64-bit
+integer column.</p>
+
+<p>Moving data from the database to an Arrow table and, thus, providing it to 
the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to 
hundred
+thousands of rows at a time. The absence of serialization and conversion logic 
renders
+the process extremely efficient.</p>
+
+<p>Once the data is stored in an Arrow table, Python users can continue to do 
some
+actual work. They can convert it into a <a 
href="https://arrow.apache.org/docs/python/pandas.html";>Pandas DataFrame</a> 
for data analysis
+(using a quick <code class="highlighter-rouge">table.to_pandas()</code>), pass 
it on to other data processing
+systems such as <a href="http://spark.apache.org/";>Apache Spark</a> or <a 
href="http://impala.apache.org/";>Apache Impala (incubating)</a>, or store
+it in the <a href="http://parquet.apache.org/";>Apache Parquet</a> file format. 
This way, non-Python systems are
+efficiently connected with relational databases.</p>
+
+<p>In the future, turbodbcâs Arrow support will be extended to use more
+sophisticated features such as <a 
href="https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding";>dictionary-encoded</a>
 string fields. We also
+plan to pick smaller than 64-bit <a 
href="https://arrow.apache.org/docs/metadata.html#integers";>data types</a> 
where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.</p>
+
+<p>If you would like to learn more about turbodbc, check out the <a 
href="https://github.com/blue-yonder/turbodbc";>GitHub project</a> and the
+<a href="http://turbodbc.readthedocs.io/";>project documentation</a>. If you 
want to learn more about how turbodbc implements the
+nitty-gritty details, check out parts <a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/";>one</a>
 and <a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/";>two</a>
 of the
+<a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/";>âMaking
 of turbodbcâ</a> series at <a href="https://tech.blue-yonder.com/";>Blue 
Yonderâs technology blog</a>.</p>
+
+
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache 
Arrow project logo are either registered trademarks or trademarks of The Apache 
Software Foundation in the United States and other countries.</p>
+  <p>&copy; 2017 Apache Software Foundation</p>
+</footer>
+
+  </div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/blog/index.html
----------------------------------------------------------------------
diff --git a/blog/index.html b/blog/index.html
index 2102219..9b2c972 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -111,6 +111,117 @@
     
   <div class="container">
     <h2>
+      Connecting Relational Databases to the Apache Arrow World with turbodbc
+      <a href="/blog/2017/06/16/turbodbc-arrow/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            16 Jun 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://github.com/MathMagique";><i class="fa fa-user"></i> 
Michael KÃ¶nig (MathMagique)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/mathmagique";>Michael KÃ¶nig</a> is the lead 
developer of the <a href="https://github.com/blue-yonder/turbodbc";>turbodbc 
project</a></em></p>
+
+<p>The <a href="https://arrow.apache.org/";>Apache Arrow</a> project set out to 
become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+<a href="https://github.com/blue-yonder/turbodbc";>turbodbc</a> brings Apache 
Arrow support to these databases using a much
+older, more specialized data exchange layer: <a 
href="https://en.wikipedia.org/wiki/Open_Database_Connectivity";>ODBC</a>.</p>
+
+<p>ODBC is a database interface that offers developers the option to transfer 
data
+either in row-wise or column-wise fashion. Previous Python ODBC modules 
typically
+use the row-wise approach, and often trade repeated database roundtrips for 
simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical databases.</p>
+
+<p>In contrast, turbodbc was designed to leverage columnar data processing 
from day
+one. Naturally, this implies using the columnar portion of the ODBC API. 
Equally
+important, however, is to find new ways of providing columnar data to Python 
users
+that exceed the capabilities of the row-wise API mandated by Pythonâs <a 
href="https://www.python.org/dev/peps/pep-0249/";>PEP 249</a>.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span 
class="kn">from</span> <span class="nn">turbodbc</span> <span 
class="kn">import</span> <span class="n">connect</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">connection</span> <span 
class="o">=</span> <span class="n">connect</span><span class="p">(</span><span 
class="n">dsn</span><span class="o">=</span><span class="s">"My columnar 
database"</span><span class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span> <span 
class="o">=</span> <span class="n">connection</span><span 
class="o">.</span><span class="n">cursor</span><span class="p">()</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span 
class="o">.</span><span class="n">execute</span><span class="p">(</span><span 
class="s">"SELECT some_integers, some_strings FROM my_table"</span><span 
class="p">)</span>
+<span class="o">&gt;&gt;&gt;</span> <span class="n">cursor</span><span 
class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span>
+<span class="n">pyarrow</span><span class="o">.</span><span 
class="n">Table</span>
+<span class="n">some_integers</span><span class="p">:</span> <span 
class="n">int64</span>
+<span class="n">some_strings</span><span class="p">:</span> <span 
class="n">string</span>
+</code></pre>
+</div>
+
+<p>With this new addition, the data flow for a result set of a typical SELECT 
query
+is like this:</p>
+<ul>
+  <li>The database prepares the result set and exposes it to the ODBC driver 
using
+either row-wise or column-wise storage.</li>
+  <li>Turbodbc has the ODBC driver write chunks of the result set into 
columnar buffers.</li>
+  <li>These buffers are exposed to turbodbcâs Apache Arrow frontend. This 
frontend
+will create an Arrow table and fill in the buffered values.</li>
+  <li>The previous steps are repeated until the entire result set is 
retrieved.</li>
+</ul>
+
+<p><img src="/img/turbodbc_arrow.png" alt="Data flow from relational databases 
to Python with turbodbc and the Apache Arrow frontend" class="img-responsive" 
width="75%" /></p>
+
+<p>In practice, it is possible to achieve the following ideal situation: A 
64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A 
huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver 
directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates 
these values
+by copying the entire 64-bit buffer into a free portion of an Arrow tableâs 
64-bit
+integer column.</p>
+
+<p>Moving data from the database to an Arrow table and, thus, providing it to 
the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to 
hundred
+thousands of rows at a time. The absence of serialization and conversion logic 
renders
+the process extremely efficient.</p>
+
+<p>Once the data is stored in an Arrow table, Python users can continue to do 
some
+actual work. They can convert it into a <a 
href="https://arrow.apache.org/docs/python/pandas.html";>Pandas DataFrame</a> 
for data analysis
+(using a quick <code class="highlighter-rouge">table.to_pandas()</code>), pass 
it on to other data processing
+systems such as <a href="http://spark.apache.org/";>Apache Spark</a> or <a 
href="http://impala.apache.org/";>Apache Impala (incubating)</a>, or store
+it in the <a href="http://parquet.apache.org/";>Apache Parquet</a> file format. 
This way, non-Python systems are
+efficiently connected with relational databases.</p>
+
+<p>In the future, turbodbcâs Arrow support will be extended to use more
+sophisticated features such as <a 
href="https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding";>dictionary-encoded</a>
 string fields. We also
+plan to pick smaller than 64-bit <a 
href="https://arrow.apache.org/docs/metadata.html#integers";>data types</a> 
where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.</p>
+
+<p>If you would like to learn more about turbodbc, check out the <a 
href="https://github.com/blue-yonder/turbodbc";>GitHub project</a> and the
+<a href="http://turbodbc.readthedocs.io/";>project documentation</a>. If you 
want to learn more about how turbodbc implements the
+nitty-gritty details, check out parts <a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/";>one</a>
 and <a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/";>two</a>
 of the
+<a 
href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/";>âMaking
 of turbodbcâ</a> series at <a href="https://tech.blue-yonder.com/";>Blue 
Yonderâs technology blog</a>.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
       Apache Arrow 0.4.1 Release
       <a href="/blog/2017/06/14/0.4.1-release/" class="permalink" 
title="Permalink">â</a>
     </h2>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/docs/ipc.html
----------------------------------------------------------------------
diff --git a/docs/ipc.html b/docs/ipc.html
index ffbe491..9a0a246 100644
--- a/docs/ipc.html
+++ b/docs/ipc.html
@@ -194,11 +194,9 @@ as an <code class="highlighter-rouge">int32</code> or 
simply closing the stream
 <p>We define a âfile formatâ supporting random access in a very similar 
format to
 the streaming format. The file starts and ends with a magic string <code 
class="highlighter-rouge">ARROW1</code>
 (plus padding). What follows in the file is identical to the stream format. At
-the end of the file, we write a <em>footer</em> containing a redundant copy of 
the
-schema (which is a part of the streaming format) plus memory offsets and sizes
-for each of the data blocks in the file. This enables random access any record
-batch in the file. See <a 
href="https://github.com/apache/arrow/blob/master/format/File.fbs";>format/File.fbs</a>
 for the precise details of the file
-footer.</p>
+the end of the file, we write a <em>footer</em> including offsets and sizes 
for each
+of the data blocks in the file, so that random access is possible. See
+<a 
href="https://github.com/apache/arrow/blob/master/format/File.fbs";>format/File.fbs</a>
 for the precise details of the file footer.</p>
 
 <p>Schematically we have:</p>
 
@@ -270,24 +268,9 @@ flatbuffer, and any padding bytes</li>
 
 <h3 id="dictionary-batches">Dictionary Batches</h3>
 
-<p>Dictionaries are written in the stream and file formats as a sequence of 
record
-batches, each having a single field. The complete semantic schema for a
-sequence of record batches, therefore, consists of the schema along with all of
-the dictionaries. The dictionary types are found in the schema, so it is
-necessary to read the schema to first determine the dictionary types so that
-the dictionaries can be properly interpreted.</p>
-
-<div class="highlighter-rouge"><pre class="highlight"><code>table 
DictionaryBatch {
-  id: long;
-  data: RecordBatch;
-}
-</code></pre>
-</div>
-
-<p>The dictionary <code class="highlighter-rouge">id</code> in the message 
metadata can be referenced one or more times
-in the schema, so that dictionaries can even be used for multiple fields. See
-the <a 
href="https://github.com/apache/arrow/blob/master/format/Layout.md";>Physical 
Layout</a> document for more about the semantics of
-dictionary-encoded data.</p>
+<p>Dictionary batches have not yet been implemented, while they are provided 
for
+in the metadata. For the time being, the <code 
class="highlighter-rouge">DICTIONARY</code> segments shown above in
+the file do not appear in any of the file implementations.</p>
 
 <h3 id="tensor-multi-dimensional-array-message-format">Tensor 
(Multi-dimensional Array) Message Format</h3>
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/docs/memory_layout.html
----------------------------------------------------------------------
diff --git a/docs/memory_layout.html b/docs/memory_layout.html
index 32c5b92..0f7f819 100644
--- a/docs/memory_layout.html
+++ b/docs/memory_layout.html
@@ -161,7 +161,7 @@ array of some array with a nested type.</li>
 <ul>
   <li>A physical memory layout enabling zero-deserialization data interchange
 amongst a variety of systems handling flat and nested columnar data, including
-such systems as Spark, Drill, Impala, Kudu, Ibis, ODBC protocols, and
+such systems as Spark, Drill, Impala, Kudu, Ibis, Spark, ODBC protocols, and
 proprietary systems that utilize the open source components.</li>
   <li>All array slots are accessible in constant time, with complexity growing
 linearly in the nesting level</li>
@@ -231,7 +231,7 @@ data-structures over 64 bytes (which will be a common case 
for Arrow Arrays).</l
 
 <p>Requiring padding to a multiple of 64 bytes allows for using <a 
href="https://software.intel.com/en-us/node/600110";>SIMD</a> instructions
 consistently in loops without additional conditional checks.
-This should allow for simpler and more efficient code.
+This should allow for simpler and more efficient code.<br />
 The specific padding length was chosen because it matches the largest known
 SIMD instruction registers available as of April 2016 (Intel AVX-512).
 Guaranteed padding can also allow certain compilers
@@ -265,7 +265,7 @@ signed integer, as it may be as large as the array 
length.</p>
 <p>Any relative type can have null value slots, whether primitive or nested 
type.</p>
 
 <p>An array with nulls must have a contiguous memory buffer, known as the null 
(or
-validity) bitmap, whose length is a multiple of 64 bytes (as discussed above)
+validity) bitmap, whose length is a multiple of 64 bytes (as discussed 
above)<br />
 and large enough to have at least 1 bit for each array
 slot.</p>
 
@@ -322,7 +322,7 @@ does not need to be adjacent in memory to the values 
buffer.</p>
 
   |Byte 0 (validity bitmap) | Bytes 1-63            |
   |-------------------------|-----------------------|
-  | 00011011                | 0 (padding)           |
+  |00011011                 | 0 (padding)           |
 
 * Value Buffer:
 
@@ -497,16 +497,16 @@ primitive value array having Int32 logical 
type.</char></p>
 <div class="highlighter-rouge"><pre class="highlight"><code>* Length: 4, Null 
count: 1
 * Null bitmap buffer:
 
-  |Byte 0 (validity bitmap) | Bytes 1-63            |
-  |-------------------------|-----------------------|
-  | 00001011                | 0 (padding)           |
+  | Byte 0 (validity bitmap) | Bytes 1-7   | Bytes 8-63  |
+  |--------------------------|-------------|-------------|
+  | 00001011                 | 0 (padding) | unspecified |
 
 * Children arrays:
   * field-0 array (`List&lt;char&gt;`):
     * Length: 4, Null count: 1
     * Null bitmap buffer:
 
-      | Byte 0 (validity bitmap) | Bytes 1-63            |
+      | Byte 0 (validity bitmap) | Bytes 1-7             |
       |--------------------------|-----------------------|
       | 00001101                 | 0 (padding)           |
 
@@ -678,7 +678,7 @@ union, it has some advantages that may be desirable in 
certain use cases:</p>
 
       |Byte 0 (validity bitmap) | Bytes 1-63            |
       |-------------------------|-----------------------|
-      | 00001010                | 0 (padding)           |
+      |00001010                 | 0 (padding)           |
 
     * Value buffer:
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index d3e09bd..d4b7a37 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,83 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2017-06-14T10:33:53-04:00</updated><id>/</id><entry><title 
type="html">Apache Arrow 0.4.1 Release</title><link 
href="/blog/2017/06/14/0.4.1-release/" rel="alternate" type="text/html" 
title="Apache Arrow 0.4.1 Release" 
/><published>2017-06-14T10:00:00-04:00</published><updated>2017-06-14T10:00:00-04:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content
 type="html" xml:base="/blog/2017/06/14/0.4.1-release/">&lt;!--
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2017-06-16T04:40:53-04:00</updated><id>/</id><entry><title 
type="html">Connecting Relational Databases to the Apache Arrow World with 
turbodbc</title><link href="/blog/2017/06/16/turbodbc-arrow/" rel="alternate" 
type="text/html" title="Connecting Relational Databases to the Apache Arrow 
World with turbodbc" 
/><published>2017-06-16T04:00:00-04:00</published><updated>2017-06-16T04:00:00-04:00</updated><id>/blog/2017/06/16/turbodbc-arrow</id><content
 type="html" xml:base="/blog/2017/06/16/turbodbc-arrow/">&lt;!--
+
+--&gt;
+
+&lt;p&gt;&lt;em&gt;&lt;a 
href=&quot;https://github.com/mathmagique&quot;&gt;Michael KÃ¶nig&lt;/a&gt; is 
the lead developer of the &lt;a 
href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;turbodbc 
project&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;The &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache 
Arrow&lt;/a&gt; project set out to become the universal data layer for
+column-oriented data processing systems without incurring serialization costs
+or compromising on performance on a more general level. While relational
+databases still lag behind in Apache Arrow adoption, the Python database module
+&lt;a 
href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;turbodbc&lt;/a&gt; 
brings Apache Arrow support to these databases using a much
+older, more specialized data exchange layer: &lt;a 
href=&quot;https://en.wikipedia.org/wiki/Open_Database_Connectivity&quot;&gt;ODBC&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;ODBC is a database interface that offers developers the option to 
transfer data
+either in row-wise or column-wise fashion. Previous Python ODBC modules 
typically
+use the row-wise approach, and often trade repeated database roundtrips for 
simplified
+buffer handling. This makes them less suited for data-intensive applications,
+particularly when interfacing with modern columnar analytical 
databases.&lt;/p&gt;
+
+&lt;p&gt;In contrast, turbodbc was designed to leverage columnar data 
processing from day
+one. Naturally, this implies using the columnar portion of the ODBC API. 
Equally
+important, however, is to find new ways of providing columnar data to Python 
users
+that exceed the capabilities of the row-wise API mandated by Pythonâs &lt;a 
href=&quot;https://www.python.org/dev/peps/pep-0249/&quot;&gt;PEP 249&lt;/a&gt;.
+Turbodbc has adopted Apache Arrow for this very task with the recently released
+version 2.0.0:&lt;/p&gt;
+
+&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span 
class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span 
class=&quot;nn&quot;&gt;turbodbc&lt;/span&gt; &lt;span 
class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;connect&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;connection&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;dsn&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;My 
columnar database&quot;&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;cursor&lt;/span&gt; &lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;connection&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;s&quot;&gt;&quot;SELECT some_integers, some_strings FROM 
my_table&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;cursor&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;fetchallarrow&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;()&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;pyarrow&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;Table&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;some_integers&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;int64&lt;/span&gt;
+&lt;span class=&quot;n&quot;&gt;some_strings&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;string&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;With this new addition, the data flow for a result set of a typical 
SELECT query
+is like this:&lt;/p&gt;
+&lt;ul&gt;
+  &lt;li&gt;The database prepares the result set and exposes it to the ODBC 
driver using
+either row-wise or column-wise storage.&lt;/li&gt;
+  &lt;li&gt;Turbodbc has the ODBC driver write chunks of the result set into 
columnar buffers.&lt;/li&gt;
+  &lt;li&gt;These buffers are exposed to turbodbcâs Apache Arrow frontend. 
This frontend
+will create an Arrow table and fill in the buffered values.&lt;/li&gt;
+  &lt;li&gt;The previous steps are repeated until the entire result set is 
retrieved.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img src=&quot;/img/turbodbc_arrow.png&quot; alt=&quot;Data flow 
from relational databases to Python with turbodbc and the Apache Arrow 
frontend&quot; class=&quot;img-responsive&quot; width=&quot;75%&quot; 
/&gt;&lt;/p&gt;
+
+&lt;p&gt;In practice, it is possible to achieve the following ideal situation: 
A 64-bit integer
+column is stored as one contiguous block of memory in a columnar database. A 
huge chunk
+of 64-bit integers is transferred over the network and the ODBC driver 
directly writes
+it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates 
these values
+by copying the entire 64-bit buffer into a free portion of an Arrow tableâs 
64-bit
+integer column.&lt;/p&gt;
+
+&lt;p&gt;Moving data from the database to an Arrow table and, thus, providing 
it to the Python
+user can be as simple as copying memory blocks around, megabytes equivalent to 
hundred
+thousands of rows at a time. The absence of serialization and conversion logic 
renders
+the process extremely efficient.&lt;/p&gt;
+
+&lt;p&gt;Once the data is stored in an Arrow table, Python users can continue 
to do some
+actual work. They can convert it into a &lt;a 
href=&quot;https://arrow.apache.org/docs/python/pandas.html&quot;&gt;Pandas 
DataFrame&lt;/a&gt; for data analysis
+(using a quick &lt;code 
class=&quot;highlighter-rouge&quot;&gt;table.to_pandas()&lt;/code&gt;), pass it 
on to other data processing
+systems such as &lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache 
Spark&lt;/a&gt; or &lt;a href=&quot;http://impala.apache.org/&quot;&gt;Apache 
Impala (incubating)&lt;/a&gt;, or store
+it in the &lt;a href=&quot;http://parquet.apache.org/&quot;&gt;Apache 
Parquet&lt;/a&gt; file format. This way, non-Python systems are
+efficiently connected with relational databases.&lt;/p&gt;
+
+&lt;p&gt;In the future, turbodbcâs Arrow support will be extended to use more
+sophisticated features such as &lt;a 
href=&quot;https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding&quot;&gt;dictionary-encoded&lt;/a&gt;
 string fields. We also
+plan to pick smaller than 64-bit &lt;a 
href=&quot;https://arrow.apache.org/docs/metadata.html#integers&quot;&gt;data 
types&lt;/a&gt; where possible. Last but not
+least, Arrow support will be extended to cover the reverse direction of data
+flow, so that Python users can quickly insert Arrow tables into relational
+databases.&lt;/p&gt;
+
+&lt;p&gt;If you would like to learn more about turbodbc, check out the &lt;a 
href=&quot;https://github.com/blue-yonder/turbodbc&quot;&gt;GitHub 
project&lt;/a&gt; and the
+&lt;a href=&quot;http://turbodbc.readthedocs.io/&quot;&gt;project 
documentation&lt;/a&gt;. If you want to learn more about how turbodbc 
implements the
+nitty-gritty details, check out parts &lt;a 
href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/&quot;&gt;one&lt;/a&gt;
 and &lt;a 
href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/&quot;&gt;two&lt;/a&gt;
 of the
+&lt;a 
href=&quot;https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/&quot;&gt;âMaking
 of turbodbcâ&lt;/a&gt; series at &lt;a 
href=&quot;https://tech.blue-yonder.com/&quot;&gt;Blue Yonderâs technology 
blog&lt;/a&gt;.&lt;/p&gt;</content><author><name>MathMagique</name></author></entry><entry><title
 type="html">Apache Arrow 0.4.1 Release</title><link 
href="/blog/2017/06/14/0.4.1-release/" rel="alternate" type="text/html" 
title="Apache Arrow 0.4.1 Release" 
/><published>2017-06-14T10:00:00-04:00</published><updated>2017-06-14T10:00:00-04:00</updated><id>/blog/2017/06/14/0.4.1-release</id><content
 type="html" xml:base="/blog/2017/06/14/0.4.1-release/">&lt;!--
 
 --&gt;
 

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/22ab633a/img/turbodbc_arrow.png
----------------------------------------------------------------------
diff --git a/img/turbodbc_arrow.png b/img/turbodbc_arrow.png
new file mode 100644
index 0000000..b534bf9
Binary files /dev/null and b/img/turbodbc_arrow.png differ

arrow-site git commit: Add turbodbc guest blog post

Reply via email to