[arrow-site] 01/01: Deploy

wesm Mon, 09 Sep 2019 09:20:01 -0700

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


commit 13902323b710225388f2dc325a21b4ac3c552365
Author: Wes McKinney <[email protected]>
AuthorDate: Mon Sep 9 11:19:37 2019 -0500

    Deploy
---
 ...-manifest-1ad164066846f8f7cae0c5a8aa968bdc.json |   1 +
 blog/2019/08/08/r-package-on-cran/index.html       |  22 +-
 .../09/05/faster-strings-cpp-parquet/index.html    | 366 +++++++++++++++++++++
 blog/index.html                                    | 254 +++++++++++++-
 feed.xml                                           | 265 ++++++++++++---
 img/20190903-parquet-dictionary-column-chunk.png   | Bin 0 -> 40781 bytes
 img/20190903_parquet_read_perf.png                 | Bin 0 -> 37197 bytes
 img/20190903_parquet_write_perf.png                | Bin 0 -> 11605 bytes
 8 files changed, 836 insertions(+), 72 deletions(-)

diff --git a/assets/.sprockets-manifest-1ad164066846f8f7cae0c5a8aa968bdc.json 
b/assets/.sprockets-manifest-1ad164066846f8f7cae0c5a8aa968bdc.json
new file mode 100644
index 0000000..377f1e3
--- /dev/null
+++ b/assets/.sprockets-manifest-1ad164066846f8f7cae0c5a8aa968bdc.json
@@ -0,0 +1 @@
+{"files":{"main-8d2a359fd27a888246eb638b36a4e8b68ac65b9f11c48b9fac601fa0c9a7d796.js":{"logical_path":"main.js","mtime":"2019-08-13T09:48:49-04:00","size":124533,"digest":"8d2a359fd27a888246eb638b36a4e8b68ac65b9f11c48b9fac601fa0c9a7d796","integrity":"sha256-jSo1n9J6iIJG62OLNqTotorGW58RxIufrGAfoMmn15Y="}},"assets":{"main.js":"main-8d2a359fd27a888246eb638b36a4e8b68ac65b9f11c48b9fac601fa0c9a7d796.js"}}
\ No newline at end of file
diff --git a/blog/2019/08/08/r-package-on-cran/index.html 
b/blog/2019/08/08/r-package-on-cran/index.html
index e583201..0852f65 100644
--- a/blog/2019/08/08/r-package-on-cran/index.html
+++ b/blog/2019/08/08/r-package-on-cran/index.html
@@ -195,12 +195,12 @@ library.</p>
 
 <h2 id="parquet-files">Parquet files</h2>
 
-<p>This release introduces basic read and write support for the <a 
href="https://parquet.apache.org/";>Apache
-Parquet</a> columnar data file format. Prior to this
-release, options for accessing Parquet data in R were limited; the most common
-recommendation was to use Apache Spark. The <code 
class="highlighter-rouge">arrow</code> package greatly simplifies
-this access and lets you go from a Parquet file to a <code 
class="highlighter-rouge">data.frame</code> and back
-easily, without having to set up a database.</p>
+<p>This package introduces basic read and write support for the <a 
href="https://parquet.apache.org/";>Apache
+Parquet</a> columnar data file format. Prior to its
+availability, options for accessing Parquet data in R were limited; the most
+common recommendation was to use Apache Spark. The <code 
class="highlighter-rouge">arrow</code> package greatly
+simplifies this access and lets you go from a Parquet file to a <code 
class="highlighter-rouge">data.frame</code>
+and back easily, without having to set up a database.</p>
 
 <div class="language-r highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">library</span><span 
class="p">(</span><span class="n">arrow</span><span class="p">)</span><span 
class="w">
 </span><span class="n">df</span><span class="w"> </span><span 
class="o">&lt;-</span><span class="w"> </span><span 
class="n">read_parquet</span><span class="p">(</span><span 
class="s2">"path/to/file.parquet"</span><span class="p">)</span><span class="w">
@@ -236,7 +236,7 @@ future.</p>
 
 <h2 id="feather-files">Feather files</h2>
 
-<p>This release also includes a faster and more robust implementation of the
+<p>This package also includes a faster and more robust implementation of the
 Feather file format, providing <code 
class="highlighter-rouge">read_feather()</code> and
 <code class="highlighter-rouge">write_feather()</code>. <a 
href="https://github.com/wesm/feather";>Feather</a> was one of the
 initial applications of Apache Arrow for Python and R, providing an efficient,
@@ -249,10 +249,10 @@ years, the Python implementation of Feather has just been 
a wrapper around
 <code class="highlighter-rouge">pyarrow</code>. This meant that as Arrow 
progressed and bugs were fixed, the Python
 version of Feather got the improvements but sadly R did not.</p>
 
-<p>With this release, the R implementation of Feather catches up and now 
depends
-on the same underlying C++ library as the Python version does. This should
-result in more reliable and consistent behavior across the two languages, as
-well as <a href="https://wesmckinney.com/blog/feather-arrow-future/";>improved
+<p>With the <code class="highlighter-rouge">arrow</code> package, the R 
implementation of Feather catches up and now
+depends on the same underlying C++ library as the Python version does. This
+should result in more reliable and consistent behavior across the two
+languages, as well as <a 
href="https://wesmckinney.com/blog/feather-arrow-future/";>improved
 performance</a>.</p>
 
 <p>We encourage all R users of <code class="highlighter-rouge">feather</code> 
to switch to using
diff --git a/blog/2019/09/05/faster-strings-cpp-parquet/index.html 
b/blog/2019/09/05/faster-strings-cpp-parquet/index.html
new file mode 100644
index 0000000..ca47598
--- /dev/null
+++ b/blog/2019/09/05/faster-strings-cpp-parquet/index.html
@@ -0,0 +1,366 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Apache Arrow Homepage</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.8.4">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <link rel="stylesheet" 
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js"; 
integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
 crossorigin="anonymous"></script>
+    <script 
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js"; 
integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
 crossorigin="anonymous"></script>
+    
+    <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async 
src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1";></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments)};
+  gtag('js', new Date());
+
+  gtag('config', 'UA-107500873-1');
+</script>
+
+    
+  </head>
+
+
+<body class="wrap">
+  <header>
+    <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+  <a class="navbar-brand" href="/"><img src="/img/arrow-inverse-300px.png" 
height="60px"/></a>
+  <button class="navbar-toggler" type="button" data-toggle="collapse" 
data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false" 
aria-label="Toggle navigation">
+    <span class="navbar-toggler-icon"></span>
+  </button>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownProjectLinks" role="button" 
data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Project Links
+          </a>
+          <div class="dropdown-menu" 
aria-labelledby="navbarDropdownProjectLinks">
+            <a class="dropdown-item" href="/install/">Installation</a>
+            <a class="dropdown-item" href="/release/">Releases</a>
+            <a class="dropdown-item" href="/faq/">FAQ</a>
+            <a class="dropdown-item" href="/blog/">Blog</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow";>Source Code</a>
+            <a class="dropdown-item" 
href="https://issues.apache.org/jira/browse/ARROW";>Issue Tracker</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Community
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+            <a class="dropdown-item" 
href="http://mail-archives.apache.org/mod_mbox/arrow-user/";>User Mailing 
List</a>
+            <a class="dropdown-item" 
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/";>Dev Mailing List</a>
+            <a class="dropdown-item" 
href="https://cwiki.apache.org/confluence/display/ARROW";>Developer Wiki</a>
+            <a class="dropdown-item" href="/committers/">Committers</a>
+            <a class="dropdown-item" href="/powered_by/">Powered By</a>
+          </div>
+        </li>
+        <li class="nav-item">
+          <a class="nav-link" href="/docs/format/README.html"
+             role="button" aria-haspopup="true" aria-expanded="false">
+             Specification
+          </a>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownDocumentation" role="button" 
data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             Documentation
+          </a>
+          <div class="dropdown-menu" 
aria-labelledby="navbarDropdownDocumentation">
+            <a class="dropdown-item" href="/docs">Project Docs</a>
+            <a class="dropdown-item" href="/docs/python">Python</a>
+            <a class="dropdown-item" href="/docs/cpp">C++</a>
+            <a class="dropdown-item" href="/docs/java">Java</a>
+            <a class="dropdown-item" href="/docs/c_glib">C GLib</a>
+            <a class="dropdown-item" href="/docs/js">JavaScript</a>
+            <a class="dropdown-item" href="/docs/r">R</a>
+          </div>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#"
+             id="navbarDropdownASF" role="button" data-toggle="dropdown"
+             aria-haspopup="true" aria-expanded="false">
+             ASF Links
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownASF">
+            <a class="dropdown-item" href="http://www.apache.org/";>ASF 
Website</a>
+            <a class="dropdown-item" 
href="http://www.apache.org/licenses/";>License</a>
+            <a class="dropdown-item" 
href="http://www.apache.org/foundation/sponsorship.html";>Donate</a>
+            <a class="dropdown-item" 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a>
+            <a class="dropdown-item" 
href="http://www.apache.org/security/";>Security</a>
+          </div>
+        </li>
+      </ul>
+      <div class="flex-row justify-content-end ml-md-auto">
+        <a class="d-sm-none d-md-inline pr-2" 
href="https://www.apache.org/events/current-event.html";>
+          <img src="https://www.apache.org/events/current-event-234x60.png"/>
+        </a>
+        <a href="http://www.apache.org/";>
+          <img src="/img/asf_logo.svg" width="120px"/>
+        </a>
+      </div>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+  </header>
+
+  <div class="container p-lg-4">
+    <main role="main">
+    
+    
+    
+<h1>
+  Faster C++ Apache Parquet performance on dictionary-encoded string data 
coming in Apache Arrow 0.15
+  <a href="/blog/2019/09/05/faster-strings-cpp-parquet/" class="permalink" 
title="Permalink">∞</a>
+</h1>
+
+
+
+<p>
+  <span class="badge badge-secondary">Published</span>
+  <span class="published">
+    05 Sep 2019
+  </span>
+  <br />
+  <span class="badge badge-secondary">By</span>
+  
+    Wes McKinney
+  
+</p>
+
+
+    <!--
+
+-->
+
+<p>We have been implementing a series of optimizations in the Apache Parquet 
C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string data, with new “native” support for
+Arrow’s dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.</p>
+
+<p>This post reviews work that was done and shows benchmarks comparing Arrow
+0.12.1 with the current development version (to be released soon as Arrow
+0.15.0).</p>
+
+<h1 id="summary-of-work">Summary of work</h1>
+
+<p>One of the largest and most complex optimizations involves encoding and
+decoding Parquet files’ internal dictionary-encoded data streams to and from
+Arrow’s in-memory dictionary-encoded <code 
class="highlighter-rouge">DictionaryArray</code>
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal “dictionary” or “categorical” type. I will go into more
+detail about this below.</p>
+
+<p>Some of the particular JIRA issues related to this work include:</p>
+
+<ul>
+  <li>Vectorize comparators for computing statistics (<a 
href="https://issues.apache.org/jira/browse/PARQUET-1523";>PARQUET-1523</a>)</li>
+  <li>Read binary directly data directly into dictionary builder
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-3769";>ARROW-3769</a>)</li>
+  <li>Writing Parquet’s dictionary indices directly into dictionary builder
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-3772";>ARROW-3772</a>)</li>
+  <li>Write dense (non-dictionary) Arrow arrays directly into Parquet data 
encoders
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-6152";>ARROW-6152</a>)</li>
+  <li>Direct writing of <code 
class="highlighter-rouge">arrow::DictionaryArray</code> to Parquet column 
writers (<a 
href="https://issues.apache.org/jira/browse/ARROW-3246";>ARROW-3246</a>)</li>
+  <li>Supporting changing dictionaries (<a 
href="https://issues.apache.org/jira/browse/ARROW-3144";>ARROW-3144</a>)</li>
+  <li>Internal IO optimizations and improved raw <code 
class="highlighter-rouge">BYTE_ARRAY</code> encoding performance
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-4398";>ARROW-4398</a>)</li>
+</ul>
+
+<p>One of the challenges of developing the Parquet C++ library is that we 
maintain
+low-level read and write APIs that do not involve the Arrow columnar data
+structures. So we have had to take care to implement Arrow-related
+optimizations without impacting non-Arrow Parquet users, which includes
+database systems like Clickhouse and Vertica.</p>
+
+<h1 id="background-how-parquet-files-do-dictionary-encoding">Background: how 
Parquet files do dictionary encoding</h1>
+
+<p>Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. MATLAB or pandas users will know this as the Categorical
+type (see <a 
href="https://www.mathworks.com/help/matlab/categorical-arrays.html";>MATLAB 
docs</a> or <a 
href="https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html";>pandas
 docs</a>) while in R such encoding is
+known as <a 
href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html";><code
 class="highlighter-rouge">factor</code></a>. In the Arrow C++ library and 
various bindings we have
+the <code class="highlighter-rouge">DictionaryArray</code> object for 
representing such data in memory.</p>
+
+<p>For example, an array such as</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+</code></pre></div></div>
+
+<p>has dictionary-encoded form</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+</code></pre></div></div>
+
+<p>The <a 
href="https://github.com/apache/parquet-format/blob/master/Encodings.md";>Parquet
 format uses dictionary encoding</a> to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 
'orange']
+</code></pre></div></div>
+
+<p>the indices would be encoded like</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>[rle-run=(6, 0),
+ bit-packed-run=[1]]
+</code></pre></div></div>
+
+<p>The full details of the rle-bitpacking encoding are found in the <a 
href="https://github.com/apache/parquet-format/blob/master/Encodings.md";>Parquet
+specification</a>.</p>
+
+<p>When writing a Parquet file, most implementations will use dictionary 
encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+“fall back” to <code class="highlighter-rouge">PLAIN</code> encoding where 
values are written end-to-end in “data
+pages” and then usually compressed with Snappy or Gzip. See the following rough
+diagram:</p>
+
+<div align="center">
+<img src="/img/20190903-parquet-dictionary-column-chunk.png" alt="Internal 
ColumnChunk structure" width="80%" class="img-responsive" />
+</div>
+
+<h1 id="faster-reading-and-writing-of-dictionary-encoded-data">Faster reading 
and writing of dictionary-encoded data</h1>
+
+<p>When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this “dense” materialization. There are several issues to deal 
with:</p>
+
+<ul>
+  <li>A Parquet file often contains multiple ColumnChunks for each semantic 
column,
+and the dictionary values may be different in each ColumnChunk</li>
+  <li>We must gracefully handle the “fall back” portion which is not
+dictionary-encoded</li>
+</ul>
+
+<p>We pursued several avenues to help with this:</p>
+
+<ul>
+  <li>Allowing each <code class="highlighter-rouge">DictionaryArray</code> to 
have a different dictionary (before, the
+dictionary was part of the <code 
class="highlighter-rouge">DictionaryType</code>, which caused problems)</li>
+  <li>We enabled the Parquet dictionary indices to be directly written into an
+Arrow <code class="highlighter-rouge">DictionaryBuilder</code> without 
rehashing the data</li>
+  <li>When decoding a ColumnChunk, we first append the dictionary values and
+indices into an Arrow <code 
class="highlighter-rouge">DictionaryBuilder</code>, and when we encounter the 
“fall
+back” portion we use a hash table to convert those values to
+dictionary-encoded form</li>
+  <li>We override the “fall back” logic when writing a ColumnChunk from an
+<code class="highlighter-rouge">DictionaryArray</code> so that reading such 
data back is more efficient</li>
+</ul>
+
+<p>All of these things together have produced some excellent performance 
results
+that we will detail below.</p>
+
+<p>The other class of optimizations we implemented was removing an abstraction
+layer between the low-level Parquet column data encoder and decoder classes and
+the Arrow columnar data structures. This involves:</p>
+
+<ul>
+  <li>Adding <code class="highlighter-rouge">ColumnWriter::WriteArrow</code> 
and <code class="highlighter-rouge">Encoder::Put</code> methods that accept
+<code class="highlighter-rouge">arrow::Array</code> objects directly</li>
+  <li>Adding <code 
class="highlighter-rouge">ByteArrayDecoder::DecodeArrow</code> method to decode 
binary data directly
+into an <code class="highlighter-rouge">arrow::BinaryBuilder</code>.</li>
+</ul>
+
+<p>While the performance improvements from this work are less dramatic than for
+dictionary-encoded data, they are still meaningful in real-world 
applications.</p>
+
+<h1 id="performance-benchmarks">Performance Benchmarks</h1>
+
+<p>We ran some benchmarks comparing Arrow 0.12.1 with the current master
+branch. We construct two kinds of Arrow tables with 10 columns each:</p>
+
+<ul>
+  <li>“Low cardinality” and “high cardinality” variants. The “low cardinality” 
case
+has 1,000 unique string values of 32-bytes each. The “high cardinality” has
+100,000 unique values</li>
+  <li>“Dense” (non-dictionary) and “Dictionary” variants</li>
+</ul>
+
+<p><a href="https://gist.github.com/wesm/b4554e2d6028243a30eeed2c644a9066";>See 
the full benchmark script.</a></p>
+
+<p>We show both single-threaded and multithreaded read performance. The test
+machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical
+cores and 32 virtual cores. All time measurements are reported in seconds, but
+we are most interested in showing the relative performance.</p>
+
+<p>First, the writing benchmarks:</p>
+
+<div align="center">
+<img src="/img/20190903_parquet_write_perf.png" alt="Parquet write benchmarks" 
width="80%" class="img-responsive" />
+</div>
+
+<p>Writing <code class="highlighter-rouge">DictionaryArray</code> is 
dramatically faster due to the optimizations
+described above. We have achieved a small improvement in writing dense
+(non-dictionary) binary arrays.</p>
+
+<p>Then, the reading benchmarks:</p>
+
+<div align="center">
+<img src="/img/20190903_parquet_read_perf.png" alt="Parquet read benchmarks" 
width="80%" class="img-responsive" />
+</div>
+
+<p>Here, similarly reading <code 
class="highlighter-rouge">DictionaryArray</code> directly is many times 
faster.</p>
+
+<p>These benchmarks show that parallel reads of dense binary data may be 
slightly
+slower though single-threaded reads are now faster. We may want to do some
+profiling and see what we can do to bring read performance back in
+line. Optimizing the dense read path has not been too much of a priority
+relative to the dictionary read path in this work.</p>
+
+<h1 id="memory-use-improvements">Memory Use Improvements</h1>
+
+<p>In addition to faster performance, reading columns as dictionary-encoded can
+yield significantly less memory use.</p>
+
+<p>In the <code class="highlighter-rouge">dict-random</code> case above, we 
found that the master branch uses 405 MB of
+RAM at peak while loading a 152 MB dataset. In v0.12.1, loading the same
+Parquet file without the accelerated dictionary support uses 1.94 GB of peak
+memory while the resulting non-dictionary table occupies 1.01 GB.</p>
+
+<p>Note that we had a memory overuse bug in versions 0.14.0 and 0.14.1 fixed in
+ARROW-6060, so if you are hitting this bug you will want to upgrade to 0.15.0
+as soon as it comes out.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>There are still many Parquet-related optimizations that we may pursue in the
+future, but the ones here can be very helpful to people working with
+string-heavy datasets, both in performance and memory use. If you’d like to
+discuss this development work, we’d be glad to hear from you on our developer
+mailing list [email protected].</p>
+
+
+    </main>
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache 
Arrow project logo are either registered trademarks or trademarks of The Apache 
Software Foundation in the United States and other countries.</p>
+  <p>&copy; 2016-2019 The Apache Software Foundation</p>
+  <script type="text/javascript" 
src="/assets/main-8d2a359fd27a888246eb638b36a4e8b68ac65b9f11c48b9fac601fa0c9a7d796.js"
 integrity="sha256-jSo1n9J6iIJG62OLNqTotorGW58RxIufrGAfoMmn15Y=" 
crossorigin="anonymous"></script>
+</footer>
+
+  </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index 022bd6e..e5cae10 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -135,6 +135,238 @@
   <div class="blog-post" style="margin-bottom: 4rem">
     
 <h1>
+  Faster C++ Apache Parquet performance on dictionary-encoded string data 
coming in Apache Arrow 0.15
+  <a href="/blog/2019/09/05/faster-strings-cpp-parquet/" class="permalink" 
title="Permalink">∞</a>
+</h1>
+
+
+
+<p>
+  <span class="badge badge-secondary">Published</span>
+  <span class="published">
+    05 Sep 2019
+  </span>
+  <br />
+  <span class="badge badge-secondary">By</span>
+  
+    Wes McKinney
+  
+</p>
+
+    <!--
+
+-->
+
+<p>We have been implementing a series of optimizations in the Apache Parquet 
C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string data, with new “native” support for
+Arrow’s dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.</p>
+
+<p>This post reviews work that was done and shows benchmarks comparing Arrow
+0.12.1 with the current development version (to be released soon as Arrow
+0.15.0).</p>
+
+<h1 id="summary-of-work">Summary of work</h1>
+
+<p>One of the largest and most complex optimizations involves encoding and
+decoding Parquet files’ internal dictionary-encoded data streams to and from
+Arrow’s in-memory dictionary-encoded <code 
class="highlighter-rouge">DictionaryArray</code>
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal “dictionary” or “categorical” type. I will go into more
+detail about this below.</p>
+
+<p>Some of the particular JIRA issues related to this work include:</p>
+
+<ul>
+  <li>Vectorize comparators for computing statistics (<a 
href="https://issues.apache.org/jira/browse/PARQUET-1523";>PARQUET-1523</a>)</li>
+  <li>Read binary directly data directly into dictionary builder
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-3769";>ARROW-3769</a>)</li>
+  <li>Writing Parquet’s dictionary indices directly into dictionary builder
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-3772";>ARROW-3772</a>)</li>
+  <li>Write dense (non-dictionary) Arrow arrays directly into Parquet data 
encoders
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-6152";>ARROW-6152</a>)</li>
+  <li>Direct writing of <code 
class="highlighter-rouge">arrow::DictionaryArray</code> to Parquet column 
writers (<a 
href="https://issues.apache.org/jira/browse/ARROW-3246";>ARROW-3246</a>)</li>
+  <li>Supporting changing dictionaries (<a 
href="https://issues.apache.org/jira/browse/ARROW-3144";>ARROW-3144</a>)</li>
+  <li>Internal IO optimizations and improved raw <code 
class="highlighter-rouge">BYTE_ARRAY</code> encoding performance
+(<a 
href="https://issues.apache.org/jira/browse/ARROW-4398";>ARROW-4398</a>)</li>
+</ul>
+
+<p>One of the challenges of developing the Parquet C++ library is that we 
maintain
+low-level read and write APIs that do not involve the Arrow columnar data
+structures. So we have had to take care to implement Arrow-related
+optimizations without impacting non-Arrow Parquet users, which includes
+database systems like Clickhouse and Vertica.</p>
+
+<h1 id="background-how-parquet-files-do-dictionary-encoding">Background: how 
Parquet files do dictionary encoding</h1>
+
+<p>Many direct and indirect users of Apache Arrow use dictionary encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. MATLAB or pandas users will know this as the Categorical
+type (see <a 
href="https://www.mathworks.com/help/matlab/categorical-arrays.html";>MATLAB 
docs</a> or <a 
href="https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html";>pandas
 docs</a>) while in R such encoding is
+known as <a 
href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html";><code
 class="highlighter-rouge">factor</code></a>. In the Arrow C++ library and 
various bindings we have
+the <code class="highlighter-rouge">DictionaryArray</code> object for 
representing such data in memory.</p>
+
+<p>For example, an array such as</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>['apple', 'orange', 'apple', NULL, 'orange', 'orange']
+</code></pre></div></div>
+
+<p>has dictionary-encoded form</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+</code></pre></div></div>
+
+<p>The <a 
href="https://github.com/apache/parquet-format/blob/master/Encodings.md";>Parquet
 format uses dictionary encoding</a> to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 
'orange']
+</code></pre></div></div>
+
+<p>the indices would be encoded like</p>
+
+<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>[rle-run=(6, 0),
+ bit-packed-run=[1]]
+</code></pre></div></div>
+
+<p>The full details of the rle-bitpacking encoding are found in the <a 
href="https://github.com/apache/parquet-format/blob/master/Encodings.md";>Parquet
+specification</a>.</p>
+
+<p>When writing a Parquet file, most implementations will use dictionary 
encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+“fall back” to <code class="highlighter-rouge">PLAIN</code> encoding where 
values are written end-to-end in “data
+pages” and then usually compressed with Snappy or Gzip. See the following rough
+diagram:</p>
+
+<div align="center">
+<img src="/img/20190903-parquet-dictionary-column-chunk.png" alt="Internal 
ColumnChunk structure" width="80%" class="img-responsive" />
+</div>
+
+<h1 id="faster-reading-and-writing-of-dictionary-encoded-data">Faster reading 
and writing of dictionary-encoded data</h1>
+
+<p>When reading a Parquet file, the dictionary-encoded portions are usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this “dense” materialization. There are several issues to deal 
with:</p>
+
+<ul>
+  <li>A Parquet file often contains multiple ColumnChunks for each semantic 
column,
+and the dictionary values may be different in each ColumnChunk</li>
+  <li>We must gracefully handle the “fall back” portion which is not
+dictionary-encoded</li>
+</ul>
+
+<p>We pursued several avenues to help with this:</p>
+
+<ul>
+  <li>Allowing each <code class="highlighter-rouge">DictionaryArray</code> to 
have a different dictionary (before, the
+dictionary was part of the <code 
class="highlighter-rouge">DictionaryType</code>, which caused problems)</li>
+  <li>We enabled the Parquet dictionary indices to be directly written into an
+Arrow <code class="highlighter-rouge">DictionaryBuilder</code> without 
rehashing the data</li>
+  <li>When decoding a ColumnChunk, we first append the dictionary values and
+indices into an Arrow <code 
class="highlighter-rouge">DictionaryBuilder</code>, and when we encounter the 
“fall
+back” portion we use a hash table to convert those values to
+dictionary-encoded form</li>
+  <li>We override the “fall back” logic when writing a ColumnChunk from an
+<code class="highlighter-rouge">DictionaryArray</code> so that reading such 
data back is more efficient</li>
+</ul>
+
+<p>All of these things together have produced some excellent performance 
results
+that we will detail below.</p>
+
+<p>The other class of optimizations we implemented was removing an abstraction
+layer between the low-level Parquet column data encoder and decoder classes and
+the Arrow columnar data structures. This involves:</p>
+
+<ul>
+  <li>Adding <code class="highlighter-rouge">ColumnWriter::WriteArrow</code> 
and <code class="highlighter-rouge">Encoder::Put</code> methods that accept
+<code class="highlighter-rouge">arrow::Array</code> objects directly</li>
+  <li>Adding <code 
class="highlighter-rouge">ByteArrayDecoder::DecodeArrow</code> method to decode 
binary data directly
+into an <code class="highlighter-rouge">arrow::BinaryBuilder</code>.</li>
+</ul>
+
+<p>While the performance improvements from this work are less dramatic than for
+dictionary-encoded data, they are still meaningful in real-world 
applications.</p>
+
+<h1 id="performance-benchmarks">Performance Benchmarks</h1>
+
+<p>We ran some benchmarks comparing Arrow 0.12.1 with the current master
+branch. We construct two kinds of Arrow tables with 10 columns each:</p>
+
+<ul>
+  <li>“Low cardinality” and “high cardinality” variants. The “low cardinality” 
case
+has 1,000 unique string values of 32-bytes each. The “high cardinality” has
+100,000 unique values</li>
+  <li>“Dense” (non-dictionary) and “Dictionary” variants</li>
+</ul>
+
+<p><a href="https://gist.github.com/wesm/b4554e2d6028243a30eeed2c644a9066";>See 
the full benchmark script.</a></p>
+
+<p>We show both single-threaded and multithreaded read performance. The test
+machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical
+cores and 32 virtual cores. All time measurements are reported in seconds, but
+we are most interested in showing the relative performance.</p>
+
+<p>First, the writing benchmarks:</p>
+
+<div align="center">
+<img src="/img/20190903_parquet_write_perf.png" alt="Parquet write benchmarks" 
width="80%" class="img-responsive" />
+</div>
+
+<p>Writing <code class="highlighter-rouge">DictionaryArray</code> is 
dramatically faster due to the optimizations
+described above. We have achieved a small improvement in writing dense
+(non-dictionary) binary arrays.</p>
+
+<p>Then, the reading benchmarks:</p>
+
+<div align="center">
+<img src="/img/20190903_parquet_read_perf.png" alt="Parquet read benchmarks" 
width="80%" class="img-responsive" />
+</div>
+
+<p>Here, similarly reading <code 
class="highlighter-rouge">DictionaryArray</code> directly is many times 
faster.</p>
+
+<p>These benchmarks show that parallel reads of dense binary data may be 
slightly
+slower though single-threaded reads are now faster. We may want to do some
+profiling and see what we can do to bring read performance back in
+line. Optimizing the dense read path has not been too much of a priority
+relative to the dictionary read path in this work.</p>
+
+<h1 id="memory-use-improvements">Memory Use Improvements</h1>
+
+<p>In addition to faster performance, reading columns as dictionary-encoded can
+yield significantly less memory use.</p>
+
+<p>In the <code class="highlighter-rouge">dict-random</code> case above, we 
found that the master branch uses 405 MB of
+RAM at peak while loading a 152 MB dataset. In v0.12.1, loading the same
+Parquet file without the accelerated dictionary support uses 1.94 GB of peak
+memory while the resulting non-dictionary table occupies 1.01 GB.</p>
+
+<p>Note that we had a memory overuse bug in versions 0.14.0 and 0.14.1 fixed in
+ARROW-6060, so if you are hitting this bug you will want to upgrade to 0.15.0
+as soon as it comes out.</p>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>There are still many Parquet-related optimizations that we may pursue in the
+future, but the ones here can be very helpful to people working with
+string-heavy datasets, both in performance and memory use. If you’d like to
+discuss this development work, we’d be glad to hear from you on our developer
+mailing list [email protected].</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="blog-post" style="margin-bottom: 4rem">
+    
+<h1>
   Apache Arrow R Package On CRAN
   <a href="/blog/2019/08/08/r-package-on-cran/" class="permalink" 
title="Permalink">∞</a>
 </h1>
@@ -201,12 +433,12 @@ library.</p>
 
 <h2 id="parquet-files">Parquet files</h2>
 
-<p>This release introduces basic read and write support for the <a 
href="https://parquet.apache.org/";>Apache
-Parquet</a> columnar data file format. Prior to this
-release, options for accessing Parquet data in R were limited; the most common
-recommendation was to use Apache Spark. The <code 
class="highlighter-rouge">arrow</code> package greatly simplifies
-this access and lets you go from a Parquet file to a <code 
class="highlighter-rouge">data.frame</code> and back
-easily, without having to set up a database.</p>
+<p>This package introduces basic read and write support for the <a 
href="https://parquet.apache.org/";>Apache
+Parquet</a> columnar data file format. Prior to its
+availability, options for accessing Parquet data in R were limited; the most
+common recommendation was to use Apache Spark. The <code 
class="highlighter-rouge">arrow</code> package greatly
+simplifies this access and lets you go from a Parquet file to a <code 
class="highlighter-rouge">data.frame</code>
+and back easily, without having to set up a database.</p>
 
 <div class="language-r highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="n">library</span><span 
class="p">(</span><span class="n">arrow</span><span class="p">)</span><span 
class="w">
 </span><span class="n">df</span><span class="w"> </span><span 
class="o">&lt;-</span><span class="w"> </span><span 
class="n">read_parquet</span><span class="p">(</span><span 
class="s2">"path/to/file.parquet"</span><span class="p">)</span><span class="w">
@@ -242,7 +474,7 @@ future.</p>
 
 <h2 id="feather-files">Feather files</h2>
 
-<p>This release also includes a faster and more robust implementation of the
+<p>This package also includes a faster and more robust implementation of the
 Feather file format, providing <code 
class="highlighter-rouge">read_feather()</code> and
 <code class="highlighter-rouge">write_feather()</code>. <a 
href="https://github.com/wesm/feather";>Feather</a> was one of the
 initial applications of Apache Arrow for Python and R, providing an efficient,
@@ -255,10 +487,10 @@ years, the Python implementation of Feather has just been 
a wrapper around
 <code class="highlighter-rouge">pyarrow</code>. This meant that as Arrow 
progressed and bugs were fixed, the Python
 version of Feather got the improvements but sadly R did not.</p>
 
-<p>With this release, the R implementation of Feather catches up and now 
depends
-on the same underlying C++ library as the Python version does. This should
-result in more reliable and consistent behavior across the two languages, as
-well as <a href="https://wesmckinney.com/blog/feather-arrow-future/";>improved
+<p>With the <code class="highlighter-rouge">arrow</code> package, the R 
implementation of Feather catches up and now
+depends on the same underlying C++ library as the Python version does. This
+should result in more reliable and consistent behavior across the two
+languages, as well as <a 
href="https://wesmckinney.com/blog/feather-arrow-future/";>improved
 performance</a>.</p>
 
 <p>We encourage all R users of <code class="highlighter-rouge">feather</code> 
to switch to using
diff --git a/feed.xml b/feed.xml
index c3b66a5..951278d 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,206 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.8.4">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2019-08-27T19:11:56-04:00</updated><id>/feed.xml</id><entry><title 
type="html">Apache Arrow R Package On CRAN</title><link 
href="/blog/2019/08/08/r-package-on-cran/" rel="alternate" type="text/html" 
title="Apache Ar [...]
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.8.4">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2019-09-09T11:48:28-04:00</updated><id>/feed.xml</id><entry><title 
type="html">Faster C++ Apache Parquet performance on dictionary-encoded string 
data coming in Apache Arrow 0.15</title><link href="/blog/2019/09/05/ [...]
+
+--&gt;
+
+&lt;p&gt;We have been implementing a series of optimizations in the Apache 
Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string data, with new “native” support for
+Arrow’s dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.&lt;/p&gt;
+
+&lt;p&gt;This post reviews work that was done and shows benchmarks comparing 
Arrow
+0.12.1 with the current development version (to be released soon as Arrow
+0.15.0).&lt;/p&gt;
+
+&lt;h1 id=&quot;summary-of-work&quot;&gt;Summary of work&lt;/h1&gt;
+
+&lt;p&gt;One of the largest and most complex optimizations involves encoding 
and
+decoding Parquet files’ internal dictionary-encoded data streams to and from
+Arrow’s in-memory dictionary-encoded &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt;
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal “dictionary” or “categorical” type. I will go into more
+detail about this below.&lt;/p&gt;
+
+&lt;p&gt;Some of the particular JIRA issues related to this work 
include:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Vectorize comparators for computing statistics (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/PARQUET-1523&quot;&gt;PARQUET-1523&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Read binary directly data directly into dictionary builder
+(&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-3769&quot;&gt;ARROW-3769&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Writing Parquet’s dictionary indices directly into dictionary 
builder
+(&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-3772&quot;&gt;ARROW-3772&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Write dense (non-dictionary) Arrow arrays directly into Parquet 
data encoders
+(&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-6152&quot;&gt;ARROW-6152&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Direct writing of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;arrow::DictionaryArray&lt;/code&gt; to 
Parquet column writers (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-3246&quot;&gt;ARROW-3246&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Supporting changing dictionaries (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-3144&quot;&gt;ARROW-3144&lt;/a&gt;)&lt;/li&gt;
+  &lt;li&gt;Internal IO optimizations and improved raw &lt;code 
class=&quot;highlighter-rouge&quot;&gt;BYTE_ARRAY&lt;/code&gt; encoding 
performance
+(&lt;a 
href=&quot;https://issues.apache.org/jira/browse/ARROW-4398&quot;&gt;ARROW-4398&lt;/a&gt;)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;One of the challenges of developing the Parquet C++ library is that 
we maintain
+low-level read and write APIs that do not involve the Arrow columnar data
+structures. So we have had to take care to implement Arrow-related
+optimizations without impacting non-Arrow Parquet users, which includes
+database systems like Clickhouse and Vertica.&lt;/p&gt;
+
+&lt;h1 
id=&quot;background-how-parquet-files-do-dictionary-encoding&quot;&gt;Background:
 how Parquet files do dictionary encoding&lt;/h1&gt;
+
+&lt;p&gt;Many direct and indirect users of Apache Arrow use dictionary 
encoding to
+improve performance and memory use on binary or string data types that include
+many repeated values. MATLAB or pandas users will know this as the Categorical
+type (see &lt;a 
href=&quot;https://www.mathworks.com/help/matlab/categorical-arrays.html&quot;&gt;MATLAB
 docs&lt;/a&gt; or &lt;a 
href=&quot;https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html&quot;&gt;pandas
 docs&lt;/a&gt;) while in R such encoding is
+known as &lt;a 
href=&quot;https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html&quot;&gt;&lt;code
 class=&quot;highlighter-rouge&quot;&gt;factor&lt;/code&gt;&lt;/a&gt;. In the 
Arrow C++ library and various bindings we have
+the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt; object for 
representing such data in memory.&lt;/p&gt;
+
+&lt;p&gt;For example, an array such as&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;['apple', 'orange', 'apple', NULL, 
'orange', 'orange']
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;has dictionary-encoded form&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;dictionary: ['apple', 'orange']
+indices: [0, 1, 0, NULL, 1, 1]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The &lt;a 
href=&quot;https://github.com/apache/parquet-format/blob/master/Encodings.md&quot;&gt;Parquet
 format uses dictionary encoding&lt;/a&gt; to compress data, and it is
+used for all Parquet data types, not just binary or string data. Parquet
+further uses bit-packing and run-length encoding (RLE) to compress the
+dictionary indices, so if you had data like&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;['apple', 'apple', 'apple', 'apple', 
'apple', 'apple', 'orange']
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;the indices would be encoded like&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;[rle-run=(6, 0),
+ bit-packed-run=[1]]
+&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
+
+&lt;p&gt;The full details of the rle-bitpacking encoding are found in the 
&lt;a 
href=&quot;https://github.com/apache/parquet-format/blob/master/Encodings.md&quot;&gt;Parquet
+specification&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;When writing a Parquet file, most implementations will use dictionary 
encoding
+to compress a column until the dictionary itself reaches a certain size
+threshold, usually around 1 megabyte. At this point, the column writer will
+“fall back” to &lt;code 
class=&quot;highlighter-rouge&quot;&gt;PLAIN&lt;/code&gt; encoding where values 
are written end-to-end in “data
+pages” and then usually compressed with Snappy or Gzip. See the following rough
+diagram:&lt;/p&gt;
+
+&lt;div align=&quot;center&quot;&gt;
+&lt;img src=&quot;/img/20190903-parquet-dictionary-column-chunk.png&quot; 
alt=&quot;Internal ColumnChunk structure&quot; width=&quot;80%&quot; 
class=&quot;img-responsive&quot; /&gt;
+&lt;/div&gt;
+
+&lt;h1 
id=&quot;faster-reading-and-writing-of-dictionary-encoded-data&quot;&gt;Faster 
reading and writing of dictionary-encoded data&lt;/h1&gt;
+
+&lt;p&gt;When reading a Parquet file, the dictionary-encoded portions are 
usually
+materialized to their non-dictionary-encoded form, causing binary or string
+values to be duplicated in memory. So an obvious (but not trivial) optimization
+is to skip this “dense” materialization. There are several issues to deal 
with:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;A Parquet file often contains multiple ColumnChunks for each 
semantic column,
+and the dictionary values may be different in each ColumnChunk&lt;/li&gt;
+  &lt;li&gt;We must gracefully handle the “fall back” portion which is not
+dictionary-encoded&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;We pursued several avenues to help with this:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Allowing each &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt; to have a 
different dictionary (before, the
+dictionary was part of the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryType&lt;/code&gt;, which 
caused problems)&lt;/li&gt;
+  &lt;li&gt;We enabled the Parquet dictionary indices to be directly written 
into an
+Arrow &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryBuilder&lt;/code&gt; without 
rehashing the data&lt;/li&gt;
+  &lt;li&gt;When decoding a ColumnChunk, we first append the dictionary values 
and
+indices into an Arrow &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryBuilder&lt;/code&gt;, and when 
we encounter the “fall
+back” portion we use a hash table to convert those values to
+dictionary-encoded form&lt;/li&gt;
+  &lt;li&gt;We override the “fall back” logic when writing a ColumnChunk from 
an
+&lt;code class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt; 
so that reading such data back is more efficient&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;All of these things together have produced some excellent performance 
results
+that we will detail below.&lt;/p&gt;
+
+&lt;p&gt;The other class of optimizations we implemented was removing an 
abstraction
+layer between the low-level Parquet column data encoder and decoder classes and
+the Arrow columnar data structures. This involves:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Adding &lt;code 
class=&quot;highlighter-rouge&quot;&gt;ColumnWriter::WriteArrow&lt;/code&gt; 
and &lt;code class=&quot;highlighter-rouge&quot;&gt;Encoder::Put&lt;/code&gt; 
methods that accept
+&lt;code class=&quot;highlighter-rouge&quot;&gt;arrow::Array&lt;/code&gt; 
objects directly&lt;/li&gt;
+  &lt;li&gt;Adding &lt;code 
class=&quot;highlighter-rouge&quot;&gt;ByteArrayDecoder::DecodeArrow&lt;/code&gt;
 method to decode binary data directly
+into an &lt;code 
class=&quot;highlighter-rouge&quot;&gt;arrow::BinaryBuilder&lt;/code&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;While the performance improvements from this work are less dramatic 
than for
+dictionary-encoded data, they are still meaningful in real-world 
applications.&lt;/p&gt;
+
+&lt;h1 id=&quot;performance-benchmarks&quot;&gt;Performance 
Benchmarks&lt;/h1&gt;
+
+&lt;p&gt;We ran some benchmarks comparing Arrow 0.12.1 with the current master
+branch. We construct two kinds of Arrow tables with 10 columns each:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;“Low cardinality” and “high cardinality” variants. The “low 
cardinality” case
+has 1,000 unique string values of 32-bytes each. The “high cardinality” has
+100,000 unique values&lt;/li&gt;
+  &lt;li&gt;“Dense” (non-dictionary) and “Dictionary” variants&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;a 
href=&quot;https://gist.github.com/wesm/b4554e2d6028243a30eeed2c644a9066&quot;&gt;See
 the full benchmark script.&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;We show both single-threaded and multithreaded read performance. The 
test
+machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical
+cores and 32 virtual cores. All time measurements are reported in seconds, but
+we are most interested in showing the relative performance.&lt;/p&gt;
+
+&lt;p&gt;First, the writing benchmarks:&lt;/p&gt;
+
+&lt;div align=&quot;center&quot;&gt;
+&lt;img src=&quot;/img/20190903_parquet_write_perf.png&quot; alt=&quot;Parquet 
write benchmarks&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; 
/&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Writing &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt; is 
dramatically faster due to the optimizations
+described above. We have achieved a small improvement in writing dense
+(non-dictionary) binary arrays.&lt;/p&gt;
+
+&lt;p&gt;Then, the reading benchmarks:&lt;/p&gt;
+
+&lt;div align=&quot;center&quot;&gt;
+&lt;img src=&quot;/img/20190903_parquet_read_perf.png&quot; alt=&quot;Parquet 
read benchmarks&quot; width=&quot;80%&quot; class=&quot;img-responsive&quot; 
/&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Here, similarly reading &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DictionaryArray&lt;/code&gt; directly is 
many times faster.&lt;/p&gt;
+
+&lt;p&gt;These benchmarks show that parallel reads of dense binary data may be 
slightly
+slower though single-threaded reads are now faster. We may want to do some
+profiling and see what we can do to bring read performance back in
+line. Optimizing the dense read path has not been too much of a priority
+relative to the dictionary read path in this work.&lt;/p&gt;
+
+&lt;h1 id=&quot;memory-use-improvements&quot;&gt;Memory Use 
Improvements&lt;/h1&gt;
+
+&lt;p&gt;In addition to faster performance, reading columns as 
dictionary-encoded can
+yield significantly less memory use.&lt;/p&gt;
+
+&lt;p&gt;In the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;dict-random&lt;/code&gt; case above, we 
found that the master branch uses 405 MB of
+RAM at peak while loading a 152 MB dataset. In v0.12.1, loading the same
+Parquet file without the accelerated dictionary support uses 1.94 GB of peak
+memory while the resulting non-dictionary table occupies 1.01 GB.&lt;/p&gt;
+
+&lt;p&gt;Note that we had a memory overuse bug in versions 0.14.0 and 0.14.1 
fixed in
+ARROW-6060, so if you are hitting this bug you will want to upgrade to 0.15.0
+as soon as it comes out.&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+
+&lt;p&gt;There are still many Parquet-related optimizations that we may pursue 
in the
+future, but the ones here can be very helpful to people working with
+string-heavy datasets, both in performance and memory use. If you’d like to
+discuss this development work, we’d be glad to hear from you on our developer
+mailing list [email protected].&lt;/p&gt;</content><author><name>Wes 
McKinney</name></author></entry><entry><title type="html">Apache Arrow R 
Package On CRAN</title><link href="/blog/2019/08/08/r-package-on-cran/" 
rel="alternate" type="text/html" title="Apache Arrow R Package On CRAN" 
/><published>2019-08-08T08:00:00-04:00</published><updated>2019-08-08T08:00:00-04:00</updated><id>/blog/2019/08/08/r-package-on-cran</id><content
 type="html" xml:base="/blog/2019/08/08/r-package-on-cran/ [...]
 
 --&gt;
 
@@ -46,12 +248,12 @@ library.&lt;/p&gt;
 
 &lt;h2 id=&quot;parquet-files&quot;&gt;Parquet files&lt;/h2&gt;
 
-&lt;p&gt;This release introduces basic read and write support for the &lt;a 
href=&quot;https://parquet.apache.org/&quot;&gt;Apache
-Parquet&lt;/a&gt; columnar data file format. Prior to this
-release, options for accessing Parquet data in R were limited; the most common
-recommendation was to use Apache Spark. The &lt;code 
class=&quot;highlighter-rouge&quot;&gt;arrow&lt;/code&gt; package greatly 
simplifies
-this access and lets you go from a Parquet file to a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;data.frame&lt;/code&gt; and back
-easily, without having to set up a database.&lt;/p&gt;
+&lt;p&gt;This package introduces basic read and write support for the &lt;a 
href=&quot;https://parquet.apache.org/&quot;&gt;Apache
+Parquet&lt;/a&gt; columnar data file format. Prior to its
+availability, options for accessing Parquet data in R were limited; the most
+common recommendation was to use Apache Spark. The &lt;code 
class=&quot;highlighter-rouge&quot;&gt;arrow&lt;/code&gt; package greatly
+simplifies this access and lets you go from a Parquet file to a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;data.frame&lt;/code&gt;
+and back easily, without having to set up a database.&lt;/p&gt;
 
 &lt;div class=&quot;language-r highlighter-rouge&quot;&gt;&lt;div 
class=&quot;highlight&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;n&quot;&gt;library&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;arrow&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
 &lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span 
class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;&amp;lt;-&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_parquet&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;s2&quot;&gt;&quot;path/to/file.parquet&quot;&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
@@ -87,7 +289,7 @@ future.&lt;/p&gt;
 
 &lt;h2 id=&quot;feather-files&quot;&gt;Feather files&lt;/h2&gt;
 
-&lt;p&gt;This release also includes a faster and more robust implementation of 
the
+&lt;p&gt;This package also includes a faster and more robust implementation of 
the
 Feather file format, providing &lt;code 
class=&quot;highlighter-rouge&quot;&gt;read_feather()&lt;/code&gt; and
 &lt;code class=&quot;highlighter-rouge&quot;&gt;write_feather()&lt;/code&gt;. 
&lt;a href=&quot;https://github.com/wesm/feather&quot;&gt;Feather&lt;/a&gt; was 
one of the
 initial applications of Apache Arrow for Python and R, providing an efficient,
@@ -100,10 +302,10 @@ years, the Python implementation of Feather has just been 
a wrapper around
 &lt;code class=&quot;highlighter-rouge&quot;&gt;pyarrow&lt;/code&gt;. This 
meant that as Arrow progressed and bugs were fixed, the Python
 version of Feather got the improvements but sadly R did not.&lt;/p&gt;
 
-&lt;p&gt;With this release, the R implementation of Feather catches up and now 
depends
-on the same underlying C++ library as the Python version does. This should
-result in more reliable and consistent behavior across the two languages, as
-well as &lt;a 
href=&quot;https://wesmckinney.com/blog/feather-arrow-future/&quot;&gt;improved
+&lt;p&gt;With the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;arrow&lt;/code&gt; package, the R 
implementation of Feather catches up and now
+depends on the same underlying C++ library as the Python version does. This
+should result in more reliable and consistent behavior across the two
+languages, as well as &lt;a 
href=&quot;https://wesmckinney.com/blog/feather-arrow-future/&quot;&gt;improved
 performance&lt;/a&gt;.&lt;/p&gt;
 
 &lt;p&gt;We encourage all R users of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;feather&lt;/code&gt; to switch to using
@@ -1355,41 +1557,4 @@ Open Analytics Initiative&lt;/a&gt;.&lt;/p&gt;
 
 &lt;p&gt;In the coming months, we will continue to make progress on many 
fronts, with
 Gandiva packaging, expanded language support (especially in R), and improved
-data access (e.g. CSV, Parquet files) in 
focus.&lt;/p&gt;</content><author><name>wesm</name></author></entry><entry><title
 type="html">Apache Arrow 0.10.0 Release</title><link 
href="/blog/2018/08/07/0.10.0-release/" rel="alternate" type="text/html" 
title="Apache Arrow 0.10.0 Release" 
/><published>2018-08-07T00:00:00-04:00</published><updated>2018-08-07T00:00:00-04:00</updated><id>/blog/2018/08/07/0.10.0-release</id><content
 type="html" xml:base="/blog/2018/08/07/0.10.0-release/">&lt;!--
-
---&gt;
-
-&lt;p&gt;The Apache Arrow team is pleased to announce the 0.10.0 release. It 
is the
-product of over 4 months of development and includes &lt;a 
href=&quot;https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.10.0&quot;&gt;&lt;strong&gt;470
 resolved
-issues&lt;/strong&gt;&lt;/a&gt;. It is the largest release so far in the 
project’s history. 90
-individuals contributed to this release.&lt;/p&gt;
-
-&lt;p&gt;See the &lt;a 
href=&quot;https://arrow.apache.org/install&quot;&gt;Install Page&lt;/a&gt; to 
learn how to get the libraries for your
-platform. The &lt;a 
href=&quot;https://arrow.apache.org/release/0.10.0.html&quot;&gt;complete 
changelog&lt;/a&gt; is also available.&lt;/p&gt;
-
-&lt;p&gt;We discuss some highlights from the release and other project news in 
this
-post.&lt;/p&gt;
-
-&lt;h2 
id=&quot;offical-binary-packages-and-packaging-automation&quot;&gt;Offical 
Binary Packages and Packaging Automation&lt;/h2&gt;
-
-&lt;p&gt;One of the largest projects in this release cycle was automating our 
build and
-packaging tooling to be able to easily and reproducibly create a &lt;a 
href=&quot;https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.10.0/binaries&quot;&gt;comprehensive
-set of binary artifacts&lt;/a&gt; which have been approved and released by the 
Arrow
-PMC. We developed a tool called &lt;strong&gt;Crossbow&lt;/strong&gt; which 
uses Appveyor and Travis CI
-to build each of the different supported packages on all 3 platforms (Linux,
-macOS, and Windows). As a result of our efforts, we should be able to make more
-frequent Arrow releases. This work was led by Phillip Cloud, Kouhei Sutou, and
-Krisztián Szűcs. Bravo!&lt;/p&gt;
-
-&lt;h2 id=&quot;new-programming-languages-go-ruby-rust&quot;&gt;New 
Programming Languages: Go, Ruby, Rust&lt;/h2&gt;
-
-&lt;p&gt;This release also adds 3 new programming languages to the project: 
Go, Ruby,
-and Rust. Together with C, C++, Java, JavaScript, and Python, &lt;strong&gt;we 
now have
-some level of support for 8 programming languages&lt;/strong&gt;.&lt;/p&gt;
-
-&lt;h2 id=&quot;upcoming-roadmap&quot;&gt;Upcoming Roadmap&lt;/h2&gt;
-
-&lt;p&gt;In the coming months, we will be working to move Apache Arrow closer 
to a 1.0.0
-release. We will continue to grow new features, improve performance and
-stability, and expand support for currently supported and new programming
-languages.&lt;/p&gt;</content><author><name>wesm</name></author></entry></feed>
\ No newline at end of file
+data access (e.g. CSV, Parquet files) in 
focus.&lt;/p&gt;</content><author><name>wesm</name></author></entry></feed>
\ No newline at end of file
diff --git a/img/20190903-parquet-dictionary-column-chunk.png 
b/img/20190903-parquet-dictionary-column-chunk.png
new file mode 100644
index 0000000..38a4c14
Binary files /dev/null and b/img/20190903-parquet-dictionary-column-chunk.png 
differ
diff --git a/img/20190903_parquet_read_perf.png 
b/img/20190903_parquet_read_perf.png
new file mode 100644
index 0000000..fa4e4f5
Binary files /dev/null and b/img/20190903_parquet_read_perf.png differ
diff --git a/img/20190903_parquet_write_perf.png 
b/img/20190903_parquet_write_perf.png
new file mode 100644
index 0000000..2c91baf
Binary files /dev/null and b/img/20190903_parquet_write_perf.png differ

[arrow-site] 01/01: Deploy

Reply via email to