This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e9e47f1 Updating built site (build
f45dd485ccb491ea2681d369a0258f44cc560ac6)
e9e47f1 is described below
commit e9e47f1c43a9c5c60934c496a9d4e333f65958f2
Author: Neal Richardson <[email protected]>
AuthorDate: Wed Apr 22 23:55:32 2020 +0000
Updating built site (build f45dd485ccb491ea2681d369a0258f44cc560ac6)
---
...manifest-07b3643e10d26ac1b64aff61ff62464d.json} | 2 +-
blog/2020/04/21/0.17.0-release/index.html | 496 +++++++++++++++++++++
blog/index.html | 15 +
feed.xml | 453 ++++++++++---------
4 files changed, 757 insertions(+), 209 deletions(-)
diff --git a/assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json
b/assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
similarity index 79%
rename from assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json
rename to assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
index ec8cb97..2bf02f9 100644
--- a/assets/.sprockets-manifest-9f55fef5b0b2da26929349fe08192161.json
+++ b/assets/.sprockets-manifest-07b3643e10d26ac1b64aff61ff62464d.json
@@ -1 +1 @@
-{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-21T08:23:41-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
+{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-22T19:55:24-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
diff --git a/blog/2020/04/21/0.17.0-release/index.html
b/blog/2020/04/21/0.17.0-release/index.html
new file mode 100644
index 0000000..cd67589
--- /dev/null
+++ b/blog/2020/04/21/0.17.0-release/index.html
@@ -0,0 +1,496 @@
+<!DOCTYPE html>
+<html lang="en-US">
+ <head>
+ <meta charset="UTF-8">
+ <meta http-equiv="X-UA-Compatible" content="IE=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1">
+ <!-- The above meta tags *must* come first in the head; any other head
content must come *after* these tags -->
+
+ <title>Apache Arrow 0.17.0 Release | Apache Arrow</title>
+
+
+ <!-- Begin Jekyll SEO tag v2.6.1 -->
+<meta name="generator" content="Jekyll v3.8.4" />
+<meta property="og:title" content="Apache Arrow 0.17.0 Release" />
+<meta name="author" content="pmc" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="The Apache Arrow team is pleased to announce
the 0.17.0 release. This covers over 2 months of development work and includes
569 resolved issues from 79 distinct contributors. See the Install Page to
learn how to get the libraries for your platform. The release notes below are
not exhaustive and only expose selected highlights of the release. Many other
bugfixes and improvements have been made: we refer you to the complete
changelog. Community Since the 0 [...]
+<meta property="og:description" content="The Apache Arrow team is pleased to
announce the 0.17.0 release. This covers over 2 months of development work and
includes 569 resolved issues from 79 distinct contributors. See the Install
Page to learn how to get the libraries for your platform. The release notes
below are not exhaustive and only expose selected highlights of the release.
Many other bugfixes and improvements have been made: we refer you to the
complete changelog. Community Sinc [...]
+<link rel="canonical"
href="https://arrow.apache.org/blog/2020/04/21/0.17.0-release/" />
+<meta property="og:url"
content="https://arrow.apache.org/blog/2020/04/21/0.17.0-release/" />
+<meta property="og:site_name" content="Apache Arrow" />
+<meta property="og:image" content="https://arrow.apache.org/img/arrow.png" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2020-04-21T02:00:00-04:00" />
+<meta name="twitter:card" content="summary_large_image" />
+<meta property="twitter:image"
content="https://arrow.apache.org/img/arrow.png" />
+<meta property="twitter:title" content="Apache Arrow 0.17.0 Release" />
+<meta name="twitter:site" content="@ApacheArrow" />
+<meta name="twitter:creator" content="@pmc" />
+<script type="application/ld+json">
+{"headline":"Apache Arrow 0.17.0
Release","dateModified":"2020-04-21T02:00:00-04:00","datePublished":"2020-04-21T02:00:00-04:00","publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://arrow.apache.org/img/logo.png"},"name":"pmc"},"@type":"BlogPosting","mainEntityOfPage":{"@type":"WebPage","@id":"https://arrow.apache.org/blog/2020/04/21/0.17.0-release/"},"description":"The
Apache Arrow team is pleased to announce the 0.17.0 release. This covers over
2 months of d [...]
+<!-- End Jekyll SEO tag -->
+
+
+ <!-- favicons -->
+ <link rel="icon" type="image/png" sizes="16x16"
href="/img/favicon-16x16.png" id="light1">
+ <link rel="icon" type="image/png" sizes="32x32"
href="/img/favicon-32x32.png" id="light2">
+ <link rel="apple-touch-icon" type="image/png" sizes="180x180"
href="/img/apple-touch-icon.png" id="light3">
+ <link rel="apple-touch-icon" type="image/png" sizes="120x120"
href="/img/apple-touch-icon-120x120.png" id="light4">
+ <link rel="apple-touch-icon" type="image/png" sizes="76x76"
href="/img/apple-touch-icon-76x76.png" id="light5">
+ <link rel="apple-touch-icon" type="image/png" sizes="60x60"
href="/img/apple-touch-icon-60x60.png" id="light6">
+ <!-- dark mode favicons -->
+ <link rel="icon" type="image/png" sizes="16x16"
href="/img/favicon-16x16-dark.png" id="dark1">
+ <link rel="icon" type="image/png" sizes="32x32"
href="/img/favicon-32x32-dark.png" id="dark2">
+ <link rel="apple-touch-icon" type="image/png" sizes="180x180"
href="/img/apple-touch-icon-dark.png" id="dark3">
+ <link rel="apple-touch-icon" type="image/png" sizes="120x120"
href="/img/apple-touch-icon-120x120-dark.png" id="dark4">
+ <link rel="apple-touch-icon" type="image/png" sizes="76x76"
href="/img/apple-touch-icon-76x76-dark.png" id="dark5">
+ <link rel="apple-touch-icon" type="image/png" sizes="60x60"
href="/img/apple-touch-icon-60x60-dark.png" id="dark6">
+
+ <script>
+ // Switch to the dark-mode favicons if prefers-color-scheme: dark
+ function onUpdate() {
+ light1 = document.querySelector('link#light1');
+ light2 = document.querySelector('link#light2');
+ light3 = document.querySelector('link#light3');
+ light4 = document.querySelector('link#light4');
+ light5 = document.querySelector('link#light5');
+ light6 = document.querySelector('link#light6');
+
+ dark1 = document.querySelector('link#dark1');
+ dark2 = document.querySelector('link#dark2');
+ dark3 = document.querySelector('link#dark3');
+ dark4 = document.querySelector('link#dark4');
+ dark5 = document.querySelector('link#dark5');
+ dark6 = document.querySelector('link#dark6');
+
+ if (matcher.matches) {
+ light1.remove();
+ light2.remove();
+ light3.remove();
+ light4.remove();
+ light5.remove();
+ light6.remove();
+ document.head.append(dark1);
+ document.head.append(dark2);
+ document.head.append(dark3);
+ document.head.append(dark4);
+ document.head.append(dark5);
+ document.head.append(dark6);
+ } else {
+ dark1.remove();
+ dark2.remove();
+ dark3.remove();
+ dark4.remove();
+ dark5.remove();
+ dark6.remove();
+ document.head.append(light1);
+ document.head.append(light2);
+ document.head.append(light3);
+ document.head.append(light4);
+ document.head.append(light5);
+ document.head.append(light6);
+ }
+ }
+ matcher = window.matchMedia('(prefers-color-scheme: dark)');
+ matcher.addListener(onUpdate);
+ onUpdate();
+ </script>
+
+ <link rel="stylesheet"
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+ <link href="/css/main.css" rel="stylesheet">
+ <link href="/css/syntax.css" rel="stylesheet">
+ <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js"
integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
crossorigin="anonymous"></script>
+ <script
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js"
integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
+
+ <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async
src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script>
+<script>
+ window.dataLayer = window.dataLayer || [];
+ function gtag(){dataLayer.push(arguments)};
+ gtag('js', new Date());
+
+ gtag('config', 'UA-107500873-1');
+</script>
+
+
+ </head>
+
+
+<body class="wrap">
+ <header>
+ <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+ <a class="navbar-brand" href="/"><img src="/img/arrow-inverse-300px.png"
height="60px"/></a>
+ <button class="navbar-toggler" type="button" data-toggle="collapse"
data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <!-- Collect the nav links, forms, and other content for toggling -->
+ <div class="collapse navbar-collapse" id="arrow-navbar">
+ <ul class="nav navbar-nav">
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownProjectLinks" role="button"
data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Project Links
+ </a>
+ <div class="dropdown-menu"
aria-labelledby="navbarDropdownProjectLinks">
+ <a class="dropdown-item" href="/install/">Installation</a>
+ <a class="dropdown-item" href="/release/">Releases</a>
+ <a class="dropdown-item" href="/faq/">FAQ</a>
+ <a class="dropdown-item" href="/blog/">Blog</a>
+ <a class="dropdown-item"
href="https://github.com/apache/arrow">Source Code</a>
+ <a class="dropdown-item"
href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a>
+ </div>
+ </li>
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Community
+ </a>
+ <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+ <a class="dropdown-item"
href="http://mail-archives.apache.org/mod_mbox/arrow-user/">User Mailing
List</a>
+ <a class="dropdown-item"
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Dev Mailing List</a>
+ <a class="dropdown-item"
href="https://cwiki.apache.org/confluence/display/ARROW">Developer Wiki</a>
+ <a class="dropdown-item" href="/committers/">Committers</a>
+ <a class="dropdown-item" href="/powered_by/">Powered By</a>
+ </div>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/docs/format/Columnar.html"
+ role="button" aria-haspopup="true" aria-expanded="false">
+ Specification
+ </a>
+ </li>
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownDocumentation" role="button"
data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Documentation
+ </a>
+ <div class="dropdown-menu"
aria-labelledby="navbarDropdownDocumentation">
+ <a class="dropdown-item" href="/docs">Project Docs</a>
+ <a class="dropdown-item" href="/docs/python">Python</a>
+ <a class="dropdown-item" href="/docs/cpp">C++</a>
+ <a class="dropdown-item" href="/docs/java">Java</a>
+ <a class="dropdown-item" href="/docs/c_glib">C GLib</a>
+ <a class="dropdown-item" href="/docs/js">JavaScript</a>
+ <a class="dropdown-item" href="/docs/r">R</a>
+ </div>
+ </li>
+ <!-- <li><a href="/blog">Blog</a></li> -->
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownASF" role="button" data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ ASF Links
+ </a>
+ <div class="dropdown-menu" aria-labelledby="navbarDropdownASF">
+ <a class="dropdown-item" href="http://www.apache.org/">ASF
Website</a>
+ <a class="dropdown-item"
href="http://www.apache.org/licenses/">License</a>
+ <a class="dropdown-item"
href="http://www.apache.org/foundation/sponsorship.html">Donate</a>
+ <a class="dropdown-item"
href="http://www.apache.org/foundation/thanks.html">Thanks</a>
+ <a class="dropdown-item"
href="http://www.apache.org/security/">Security</a>
+ </div>
+ </li>
+ </ul>
+ <div class="flex-row justify-content-end ml-md-auto">
+ <a class="d-sm-none d-md-inline pr-2"
href="https://www.apache.org/events/current-event.html">
+ <img src="https://www.apache.org/events/current-event-234x60.png"/>
+ </a>
+ <a href="http://www.apache.org/">
+ <img src="/img/asf_logo.svg" width="120px"/>
+ </a>
+ </div>
+ </div><!-- /.navbar-collapse -->
+ </div>
+ </nav>
+
+ </header>
+
+ <div class="container p-lg-4">
+ <main role="main">
+
+
+
+<h1>
+ Apache Arrow 0.17.0 Release
+</h1>
+
+
+
+<p>
+ <span class="badge badge-secondary">Published</span>
+ <span class="published">
+ 21 Apr 2020
+ </span>
+ <br />
+ <span class="badge badge-secondary">By</span>
+
+ <a href="https://arrow.apache.org">The Apache Arrow PMC (pmc) </a>
+
+
+
+</p>
+
+
+ <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.17.0 release. This covers
+over 2 months of development work and includes <a
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.17.0"><strong>569
resolved issues</strong></a>
+from <a
href="https://arrow.apache.org/release/0.17.0.html#contributors"><strong>79
distinct contributors</strong></a>. See the Install Page to learn how to
+get the libraries for your platform.</p>
+
+<p>The release notes below are not exhaustive and only expose selected
highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the <a href="https://arrow.apache.org/release/0.17.0.html">complete
changelog</a>.</p>
+
+<h2 id="community">Community</h2>
+
+<p>Since the 0.16.0 release, two committers have joined the Project Management
+Committee (PMC):</p>
+
+<ul>
+ <li><a href="https://github.com/nealrichardson">Neal Richardson</a></li>
+ <li><a href="https://github.com/fsaintjacques">François
Saint-Jacques</a></li>
+</ul>
+
+<p>Thank you for all your contributions!</p>
+
+<h2 id="columnar-format-notes">Columnar Format Notes</h2>
+
+<p>A <a
href="https://arrow.apache.org/docs/format/CDataInterface.html">C-level Data
Interface</a> was designed to ease data sharing inside a single
+process. It allows different runtimes or libraries to share Arrow data using a
+well-known binary layout and metadata representation, without any copies. Third
+party libraries can use the C interface to import and export the Arrow columnar
+format in-process without requiring on any new code dependencies.</p>
+
+<p>The C++ library now includes an implementation of the C Data Interface, and
+Python and R have bindings to that implementation.</p>
+
+<h2 id="arrow-flight-rpc-notes">Arrow Flight RPC notes</h2>
+
+<ul>
+ <li>Adopted new DoExchange bi-directional data RPC</li>
+ <li>ListFlights supports being passed a Criteria argument in
+Java/C++/Python. This allows applications to search for flights satisfying a
+given query.</li>
+ <li>Custom metadata can be attached to errors that the server sends to the
+client, which can be used to encode richer application-specific
information.</li>
+ <li>A number of minor bugs were fixed, including proper handling of empty
null
+arrays in Java and round-tripping of certain Arrow status codes in
+C++/Python.</li>
+</ul>
+
+<h2 id="c-notes">C++ notes</h2>
+
+<h3 id="feather-v2">Feather V2</h3>
+
+<p>The “Feather V2” format based on the Arrow IPC file format was developed.
+Feather V2 features full support for all Arrow data types, and resolves the 2GB
+per-column limitation for large amounts of string data that the <a
href="https://github.com/wesm/feather">original
+Feather implementation</a> had. Feather V2 also introduces experimental IPC
+message compression using LZ4 frame format or ZSTD. This will be formalized
+later in the Arrow format.</p>
+
+<h3 id="c-datasets">C++ Datasets</h3>
+
+<ul>
+ <li>Improve speed on high latency file system by relaxing discovery
validation</li>
+ <li>Better performance with Arrow IPC files using column projection</li>
+ <li>Add the ability to list files in FileSystemDataset</li>
+ <li>Add support for Parquet file reader options</li>
+ <li>Support dictionary columns in partition expression</li>
+ <li>Fix various crashes and other issues</li>
+</ul>
+
+<h3 id="c-parquet-notes">C++ Parquet notes</h3>
+
+<ul>
+ <li>Complete support for writing nested types to Parquet format was
+completed. The legacy code can be accessed through parquet write option C++
+and an environment variable in Python. Read support will come in a future
+release.</li>
+ <li>The BYTE_STREAM_SPLIT encoding was implemented for floating-point types.
It
+helps improve the efficiency of memory compression for high-entropy data.</li>
+ <li>Expose Parquet schema field_id as Arrow field metadata</li>
+ <li>Support for DataPageV2 data page format</li>
+</ul>
+
+<h3 id="c-build-notes">C++ build notes</h3>
+
+<ul>
+ <li>We continued to make the core C++ library build simpler and faster.
Among the
+improvements are the removal of the dependency on Thrift IDL compiler at
+build time; while Parquet still requires the Thrift runtime C++ library, its
+dependencies are much lighter. We also further reduced the number of build
+configurations that require Boost, and when Boost is needed to be built, we
+only download the components we need, reducing the size of the Boost bundle
+by 90%.</li>
+ <li>Improved support for building on ARM platforms</li>
+ <li>Upgraded LLVM version from 7 to 8</li>
+ <li>Simplified SIMD build configuration with ARROW_SIMD_LEVEL option
allowing no
+SIMD, SSE4.2, AVX2, or AVX512 to be selected.</li>
+ <li>Fixed a number of bugs affecting compilation on aarch64 platforms</li>
+</ul>
+
+<h3 id="other-c-notes">Other C++ notes</h3>
+
+<ul>
+ <li>Many crashes on invalid input detected by <a
href="https://google.github.io/oss-fuzz/">OSS-Fuzz</a> in the IPC reader and
+in Parquet-Arrow reading were fixed. See our recent <a
href="https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/">blog
post</a> for more
+details.</li>
+ <li>A “Device” abstraction was added to simplify buffer management and
movement
+across heterogeneous hardware configurations, e.g. CPUs and GPUs.</li>
+ <li>A streaming CSV reader was implemented, yielding individual
RecordBatches and
+helping limit overall memory occupation.</li>
+ <li>Array casting from Decimal128 to integer types and to Decimal128 with
+different scale/precision was added.</li>
+ <li>Sparse CSF tensors are now supported.</li>
+ <li>When creating an Array, the null bitmap is not kept if the null count is
known to be zero</li>
+ <li>Compressor support for the LZ4 frame format (LZ4_FRAME) was added</li>
+ <li>An event-driven interface for reading IPC streams was added.</li>
+ <li>Further core APIs that required passing an explicit out-parameter were
+migrated to <code class="highlighter-rouge">Result<T></code>.</li>
+ <li>New analytics kernels for match, sort indices / argsort, top-k</li>
+</ul>
+
+<h2 id="java-notes">Java notes</h2>
+
+<ul>
+ <li>Netty dependencies were removed for BufferAllocator and ReferenceManager
+classes. In the future, we plan to move netty related classes to a separate
+module.</li>
+ <li>New features were provided to support efficiently appending vector/vector
+schema root values in batch.</li>
+ <li>Comparing a range of values in dense union vectors has been
supported.</li>
+ <li>The quick sort algorithm was improved to avoid degenerating to the worst
case.</li>
+</ul>
+
+<h2 id="python-notes">Python notes</h2>
+
+<h3 id="datasets">Datasets</h3>
+
+<ul>
+ <li>Updated <code class="highlighter-rouge">pyarrow.dataset</code> module
following the changes in the C++ Datasets
+project. This release also adds <a
href="https://arrow.apache.org/docs/python/dataset.html">richer
documentation</a> on the datasets
+module.</li>
+ <li>Support for the improved dataset functionality in
+<code
class="highlighter-rouge">pyarrow.parquet.read_table/ParquetDataset</code>. To
enable, pass
+<code class="highlighter-rouge">use_legacy_dataset=False</code>. Among other
things, this allows to specify filters
+for all columns and not only the partition keys (using row group statistics)
+and enables different partitioning schemes. See the “note” in the
+<a
href="https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets"><code
class="highlighter-rouge">ParquetDataset</code> documentation</a>.</li>
+</ul>
+
+<h3 id="packaging">Packaging</h3>
+
+<ul>
+ <li>Wheels for Python 3.8 are now available</li>
+ <li>Support for Python 2.7 has been dropped as Python 2.x reached
end-of-life in
+January 2020.</li>
+ <li>Nightly wheels and conda packages are now available for testing or other
+development purposes. See the <a
href="https://arrow.apache.org/docs/python/install.html#installing-nightly-packages">installation
guide</a></li>
+</ul>
+
+<h3 id="other-improvements">Other improvements</h3>
+
+<ul>
+ <li>Conversion to numpy/pandas for FixedSizeList, LargeString,
LargeBinary</li>
+ <li>Sparse CSC matrices and Sparse CSF tensors support was added.
(ARROW-7419,
+ARROW-7427)</li>
+</ul>
+
+<h2 id="r-notes">R notes</h2>
+
+<p>Highlights include support for the Feather V2 format and the C Data
Interface,
+both described above. Along with low-level bindings for the C interface, this
+release adds tooling to work with Arrow data in Python using <code
class="highlighter-rouge">reticulate</code>. See
+<a href="https://arrow.apache.org/docs/r/articles/python.html"><code
class="highlighter-rouge">vignette("python", package = "arrow")</code></a> for
a guide to getting started.</p>
+
+<p>Installation on Linux now builds C++ the library from source by default.
For a
+faster, richer build, set the environment variable <code
class="highlighter-rouge">NOT_CRAN=true</code>. See
+<a href="https://arrow.apache.org/docs/r/articles/install.html"><code
class="highlighter-rouge">vignette("install", package = "arrow")</code></a> for
details and more options.</p>
+
+<p>For more on what’s in the 0.17 R package, see the <a
href="https://arrow.apache.org/docs/r/news/">R changelog</a>.</p>
+
+<h2 id="ruby-and-c-glib-notes">Ruby and C GLib notes</h2>
+
+<h3 id="ruby">Ruby</h3>
+
+<ul>
+ <li>Support Ruby 2.3 again</li>
+</ul>
+
+<h3 id="c-glib">C GLib</h3>
+
+<ul>
+ <li>Add GArrowRecordBatchIterator</li>
+ <li>Add support for GArrowFilterOptions</li>
+ <li>Add support for Peek() to GIOInputStream</li>
+ <li>Add some metadata bindings to GArrowSchema</li>
+ <li>Add LocalFileSystem support</li>
+ <li>Add support for writer properties of Parquet</li>
+ <li>Add support for MapArray</li>
+ <li>Add support for BooleanNode</li>
+</ul>
+
+<h2 id="rust-notes">Rust notes</h2>
+
+<ul>
+ <li>DictionayArray support.</li>
+ <li>Various improvements to code safety.</li>
+ <li>Filter kernel now supports temporal types.</li>
+</ul>
+
+<h3 id="rust-parquet-notes">Rust Parquet notes</h3>
+
+<ul>
+ <li>Array reader now supports temporal types.</li>
+ <li>Parquet writer now supports custom meta-data key/value pairs.</li>
+</ul>
+
+<h3 id="rust-datafusion-notes">Rust DataFusion notes</h3>
+
+<ul>
+ <li>Logical plans can now reference columns by name (as well as by index)
using
+the new <code class="highlighter-rouge">UnresolvedColumn</code> expression.
There is a new optimizer rule to
+resolve these into column indices.</li>
+ <li>Scalar UDFs can now be registered with the execution context and used
from
+logical query plans as well as from SQL. A number of math scalar functions
+have been implemented using this feature (sqrt, cos, sin, tan, asin, acos,
+atan, floor, ceil, round, trunc, abs, signum, exp, log, log2, log10).</li>
+ <li>Various SQL improvements, including support for <code
class="highlighter-rouge">SELECT *</code> and <code
class="highlighter-rouge">SELECT
+COUNT(*)</code>, and improvements to parsing of aggregate queries.</li>
+ <li>Flight examples are provided, with a client that sends a SQL statement
to a
+Flight server and receives the results.</li>
+ <li>The interactive SQL command-line tool now has improved documentation and
+better formatting of query results.</li>
+</ul>
+
+<h2 id="project-operations">Project Operations</h2>
+
+<p>We’ve continued our migration of general automation toward GitHub Actions.
The
+majority of our commit-by-commit continuous integration (CI) is now running on
+GitHub Actions. We are working on different solutions for using dedicated
+hardware as part of our CI. The <a href="https://buildkite.com/">Buildkite</a>
self-hosted CI/CD platform is
+now supported on Apache repositories and GitHub Actions also supports
+self-hosted workers.</p>
+
+
+ </main>
+
+ <hr/>
+<footer class="footer">
+ <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache
Arrow project logo are either registered trademarks or trademarks of The Apache
Software Foundation in the United States and other countries.</p>
+ <p>© 2016-2019 The Apache Software Foundation</p>
+ <script integrity="sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="
crossorigin="anonymous" type="text/javascript"
src="/assets/main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"></script>
+</footer>
+
+ </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index 40ecefa..1056b3a 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -217,6 +217,21 @@
<p>
<h3>
+ <a href="/blog/2020/04/21/0.17.0-release/">Apache Arrow 0.17.0
Release</a>
+ </h3>
+
+ <p>
+ <span class="blog-list-date">
+ 21 April 2020
+ </span>
+ </p>
+ The Apache Arrow team is pleased to announce the 0.17.0 release. This
covers over 2 months of development work and includes 569 resolved issues from
79 distinct contributors. See the Install Page to learn how to get the
libraries for your platform. The release notes below are not exhaustive and...
+ </p>
+
+
+
+ <p>
+ <h3>
<a href="/blog/2020/03/31/fuzzing-arrow-ipc/">Fuzzing the Arrow C++ IPC
implementation</a>
</h3>
diff --git a/feed.xml b/feed.xml
index 04feb74..b13ac67 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,247 @@
-<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="3.8.4">Jekyll</generator><link
href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml"
/><link href="https://arrow.apache.org/" rel="alternate" type="text/html"
/><updated>2020-04-21T08:23:33-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language
developm [...]
+<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="3.8.4">Jekyll</generator><link
href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml"
/><link href="https://arrow.apache.org/" rel="alternate" type="text/html"
/><updated>2020-04-22T19:55:16-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language
developm [...]
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.17.0 release. This
covers
+over 2 months of development work and includes <a
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%200.17.0"><strong>569
resolved issues</strong></a>
+from <a
href="https://arrow.apache.org/release/0.17.0.html#contributors"><strong>79
distinct contributors</strong></a>. See the Install Page to learn
how to
+get the libraries for your platform.</p>
+
+<p>The release notes below are not exhaustive and only expose selected
highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the <a
href="https://arrow.apache.org/release/0.17.0.html">complete
changelog</a>.</p>
+
+<h2 id="community">Community</h2>
+
+<p>Since the 0.16.0 release, two committers have joined the Project
Management
+Committee (PMC):</p>
+
+<ul>
+ <li><a href="https://github.com/nealrichardson">Neal
Richardson</a></li>
+ <li><a
href="https://github.com/fsaintjacques">François
Saint-Jacques</a></li>
+</ul>
+
+<p>Thank you for all your contributions!</p>
+
+<h2 id="columnar-format-notes">Columnar Format Notes</h2>
+
+<p>A <a
href="https://arrow.apache.org/docs/format/CDataInterface.html">C-level
Data Interface</a> was designed to ease data sharing inside a single
+process. It allows different runtimes or libraries to share Arrow data using a
+well-known binary layout and metadata representation, without any copies. Third
+party libraries can use the C interface to import and export the Arrow columnar
+format in-process without requiring on any new code dependencies.</p>
+
+<p>The C++ library now includes an implementation of the C Data
Interface, and
+Python and R have bindings to that implementation.</p>
+
+<h2 id="arrow-flight-rpc-notes">Arrow Flight RPC
notes</h2>
+
+<ul>
+ <li>Adopted new DoExchange bi-directional data RPC</li>
+ <li>ListFlights supports being passed a Criteria argument in
+Java/C++/Python. This allows applications to search for flights satisfying a
+given query.</li>
+ <li>Custom metadata can be attached to errors that the server sends to
the
+client, which can be used to encode richer application-specific
information.</li>
+ <li>A number of minor bugs were fixed, including proper handling of
empty null
+arrays in Java and round-tripping of certain Arrow status codes in
+C++/Python.</li>
+</ul>
+
+<h2 id="c-notes">C++ notes</h2>
+
+<h3 id="feather-v2">Feather V2</h3>
+
+<p>The “Feather V2” format based on the Arrow IPC file format was
developed.
+Feather V2 features full support for all Arrow data types, and resolves the 2GB
+per-column limitation for large amounts of string data that the <a
href="https://github.com/wesm/feather">original
+Feather implementation</a> had. Feather V2 also introduces experimental
IPC
+message compression using LZ4 frame format or ZSTD. This will be formalized
+later in the Arrow format.</p>
+
+<h3 id="c-datasets">C++ Datasets</h3>
+
+<ul>
+ <li>Improve speed on high latency file system by relaxing discovery
validation</li>
+ <li>Better performance with Arrow IPC files using column
projection</li>
+ <li>Add the ability to list files in FileSystemDataset</li>
+ <li>Add support for Parquet file reader options</li>
+ <li>Support dictionary columns in partition expression</li>
+ <li>Fix various crashes and other issues</li>
+</ul>
+
+<h3 id="c-parquet-notes">C++ Parquet notes</h3>
+
+<ul>
+ <li>Complete support for writing nested types to Parquet format was
+completed. The legacy code can be accessed through parquet write option C++
+and an environment variable in Python. Read support will come in a future
+release.</li>
+ <li>The BYTE_STREAM_SPLIT encoding was implemented for floating-point
types. It
+helps improve the efficiency of memory compression for high-entropy
data.</li>
+ <li>Expose Parquet schema field_id as Arrow field metadata</li>
+ <li>Support for DataPageV2 data page format</li>
+</ul>
+
+<h3 id="c-build-notes">C++ build notes</h3>
+
+<ul>
+ <li>We continued to make the core C++ library build simpler and
faster. Among the
+improvements are the removal of the dependency on Thrift IDL compiler at
+build time; while Parquet still requires the Thrift runtime C++ library, its
+dependencies are much lighter. We also further reduced the number of build
+configurations that require Boost, and when Boost is needed to be built, we
+only download the components we need, reducing the size of the Boost bundle
+by 90%.</li>
+ <li>Improved support for building on ARM platforms</li>
+ <li>Upgraded LLVM version from 7 to 8</li>
+ <li>Simplified SIMD build configuration with ARROW_SIMD_LEVEL option
allowing no
+SIMD, SSE4.2, AVX2, or AVX512 to be selected.</li>
+ <li>Fixed a number of bugs affecting compilation on aarch64
platforms</li>
+</ul>
+
+<h3 id="other-c-notes">Other C++ notes</h3>
+
+<ul>
+ <li>Many crashes on invalid input detected by <a
href="https://google.github.io/oss-fuzz/">OSS-Fuzz</a> in
the IPC reader and
+in Parquet-Arrow reading were fixed. See our recent <a
href="https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/">blog
post</a> for more
+details.</li>
+ <li>A “Device” abstraction was added to simplify buffer management and
movement
+across heterogeneous hardware configurations, e.g. CPUs and GPUs.</li>
+ <li>A streaming CSV reader was implemented, yielding individual
RecordBatches and
+helping limit overall memory occupation.</li>
+ <li>Array casting from Decimal128 to integer types and to Decimal128
with
+different scale/precision was added.</li>
+ <li>Sparse CSF tensors are now supported.</li>
+ <li>When creating an Array, the null bitmap is not kept if the null
count is known to be zero</li>
+ <li>Compressor support for the LZ4 frame format (LZ4_FRAME) was
added</li>
+ <li>An event-driven interface for reading IPC streams was
added.</li>
+ <li>Further core APIs that required passing an explicit out-parameter
were
+migrated to <code
class="highlighter-rouge">Result&lt;T&gt;</code>.</li>
+ <li>New analytics kernels for match, sort indices / argsort,
top-k</li>
+</ul>
+
+<h2 id="java-notes">Java notes</h2>
+
+<ul>
+ <li>Netty dependencies were removed for BufferAllocator and
ReferenceManager
+classes. In the future, we plan to move netty related classes to a separate
+module.</li>
+ <li>New features were provided to support efficiently appending
vector/vector
+schema root values in batch.</li>
+ <li>Comparing a range of values in dense union vectors has been
supported.</li>
+ <li>The quick sort algorithm was improved to avoid degenerating to the
worst case.</li>
+</ul>
+
+<h2 id="python-notes">Python notes</h2>
+
+<h3 id="datasets">Datasets</h3>
+
+<ul>
+ <li>Updated <code
class="highlighter-rouge">pyarrow.dataset</code> module
following the changes in the C++ Datasets
+project. This release also adds <a
href="https://arrow.apache.org/docs/python/dataset.html">richer
documentation</a> on the datasets
+module.</li>
+ <li>Support for the improved dataset functionality in
+<code
class="highlighter-rouge">pyarrow.parquet.read_table/ParquetDataset</code>.
To enable, pass
+<code
class="highlighter-rouge">use_legacy_dataset=False</code>.
Among other things, this allows to specify filters
+for all columns and not only the partition keys (using row group statistics)
+and enables different partitioning schemes. See the “note” in the
+<a
href="https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets"><code
class="highlighter-rouge">ParquetDataset</code>
documentation</a>.</li>
+</ul>
+
+<h3 id="packaging">Packaging</h3>
+
+<ul>
+ <li>Wheels for Python 3.8 are now available</li>
+ <li>Support for Python 2.7 has been dropped as Python 2.x reached
end-of-life in
+January 2020.</li>
+ <li>Nightly wheels and conda packages are now available for testing or
other
+development purposes. See the <a
href="https://arrow.apache.org/docs/python/install.html#installing-nightly-packages">installation
guide</a></li>
+</ul>
+
+<h3 id="other-improvements">Other improvements</h3>
+
+<ul>
+ <li>Conversion to numpy/pandas for FixedSizeList, LargeString,
LargeBinary</li>
+ <li>Sparse CSC matrices and Sparse CSF tensors support was added.
(ARROW-7419,
+ARROW-7427)</li>
+</ul>
+
+<h2 id="r-notes">R notes</h2>
+
+<p>Highlights include support for the Feather V2 format and the C Data
Interface,
+both described above. Along with low-level bindings for the C interface, this
+release adds tooling to work with Arrow data in Python using <code
class="highlighter-rouge">reticulate</code>. See
+<a
href="https://arrow.apache.org/docs/r/articles/python.html"><code
class="highlighter-rouge">vignette("python", package =
"arrow")</code></a> for a guide to getting
started.</p>
+
+<p>Installation on Linux now builds C++ the library from source by
default. For a
+faster, richer build, set the environment variable <code
class="highlighter-rouge">NOT_CRAN=true</code>. See
+<a
href="https://arrow.apache.org/docs/r/articles/install.html"><code
class="highlighter-rouge">vignette("install", package =
"arrow")</code></a> for details and more
options.</p>
+
+<p>For more on what’s in the 0.17 R package, see the <a
href="https://arrow.apache.org/docs/r/news/">R
changelog</a>.</p>
+
+<h2 id="ruby-and-c-glib-notes">Ruby and C GLib notes</h2>
+
+<h3 id="ruby">Ruby</h3>
+
+<ul>
+ <li>Support Ruby 2.3 again</li>
+</ul>
+
+<h3 id="c-glib">C GLib</h3>
+
+<ul>
+ <li>Add GArrowRecordBatchIterator</li>
+ <li>Add support for GArrowFilterOptions</li>
+ <li>Add support for Peek() to GIOInputStream</li>
+ <li>Add some metadata bindings to GArrowSchema</li>
+ <li>Add LocalFileSystem support</li>
+ <li>Add support for writer properties of Parquet</li>
+ <li>Add support for MapArray</li>
+ <li>Add support for BooleanNode</li>
+</ul>
+
+<h2 id="rust-notes">Rust notes</h2>
+
+<ul>
+ <li>DictionayArray support.</li>
+ <li>Various improvements to code safety.</li>
+ <li>Filter kernel now supports temporal types.</li>
+</ul>
+
+<h3 id="rust-parquet-notes">Rust Parquet notes</h3>
+
+<ul>
+ <li>Array reader now supports temporal types.</li>
+ <li>Parquet writer now supports custom meta-data key/value
pairs.</li>
+</ul>
+
+<h3 id="rust-datafusion-notes">Rust DataFusion notes</h3>
+
+<ul>
+ <li>Logical plans can now reference columns by name (as well as by
index) using
+the new <code
class="highlighter-rouge">UnresolvedColumn</code>
expression. There is a new optimizer rule to
+resolve these into column indices.</li>
+ <li>Scalar UDFs can now be registered with the execution context and
used from
+logical query plans as well as from SQL. A number of math scalar functions
+have been implemented using this feature (sqrt, cos, sin, tan, asin, acos,
+atan, floor, ceil, round, trunc, abs, signum, exp, log, log2,
log10).</li>
+ <li>Various SQL improvements, including support for <code
class="highlighter-rouge">SELECT *</code> and <code
class="highlighter-rouge">SELECT
+COUNT(*)</code>, and improvements to parsing of aggregate
queries.</li>
+ <li>Flight examples are provided, with a client that sends a SQL
statement to a
+Flight server and receives the results.</li>
+ <li>The interactive SQL command-line tool now has improved
documentation and
+better formatting of query results.</li>
+</ul>
+
+<h2 id="project-operations">Project Operations</h2>
+
+<p>We’ve continued our migration of general automation toward GitHub
Actions. The
+majority of our commit-by-commit continuous integration (CI) is now running on
+GitHub Actions. We are working on different solutions for using dedicated
+hardware as part of our CI. The <a
href="https://buildkite.com/">Buildkite</a> self-hosted
CI/CD platform is
+now supported on Apache repositories and GitHub Actions also supports
+self-hosted
workers.</p></content><author><name>pmc</name></author><summary
type="html">The Apache Arrow team is pleased to announce the 0.17.0 release.
This covers over 2 months of development work and includes 569 resolved issues
from 79 distinct contributors. See the Install Page to learn how to get the
libraries for your platform. The release notes below are not exhaustive and
only expose selected highlights of the release. Many other bugfixes and
improvements have been made: w [...]
-->
@@ -1780,210 +2023,4 @@ for C++</li>
data messaging use cases</li>
<li><strong>Arrow Columnar Format evolution</strong>: we
are discussing a new “duration” or
“time interval” type and some other additions to the Arrow columnar
format.</li>
-</ul></content><author><name>wesm</name></author><summary
type="html">The Apache Arrow team is pleased to announce the 0.13.0 release.
This covers more than 2 months of development work and includes 550 resolved
issues from 81 distinct contributors. See the Install Page to learn how to get
the libraries for your platform. The complete changelog is also available.
While it’s a large release, this post will give some brief highlights in the
project since the 0.12.0 release from Janua [...]
-
--->
-
-<p>Python users who upgrade to recently released <code
class="highlighter-rouge">pyarrow</code> 0.12 may find that
-their applications use significantly less memory when converting Arrow string
-data to pandas format. This includes using <code
class="highlighter-rouge">pyarrow.parquet.read_table</code>
and
-<code
class="highlighter-rouge">pandas.read_parquet</code>. This
article details some of what is going on under the
-hood, and why Python applications dealing with large amounts of strings are
-prone to memory use problems.</p>
-
-<h2 id="why-python-strings-can-use-a-lot-of-memory">Why Python
strings can use a lot of memory</h2>
-
-<p>Let’s start with some possibly surprising facts. I’m going to create
an empty
-<code class="highlighter-rouge">bytes</code> object and
an empty <code class="highlighter-rouge">str</code>
(unicode) object in Python 3.7:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [1]: val = b''
-
-In [2]: unicode_val = u''
-</code></pre></div></div>
-
-<p>The <code
class="highlighter-rouge">sys.getsizeof</code> function
accurately reports the number of bytes used by
-built-in Python objects. You might be surprised to find that:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [4]: import sys
-In [5]: sys.getsizeof(val)
-Out[5]: 33
-
-In [6]: sys.getsizeof(unicode_val)
-Out[6]: 49
-</code></pre></div></div>
-
-<p>Since strings in Python are nul-terminated, we can infer that a bytes
object
-has 32 bytes of overhead while unicode has 48 bytes. One must also account for
-<code class="highlighter-rouge">PyObject*</code> pointer
references to the objects, so the actual overhead is 40 and
-56 bytes, respectively. With large strings and text, this overhead may not
-matter much, but when you have a lot of small strings, such as those arising
-from reading a CSV or Apache Parquet file, they can take up an unexpected
-amount of memory. pandas represents strings in NumPy arrays of <code
class="highlighter-rouge">PyObject*</code>
-pointers, so the total memory used by a unique unicode string is</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>8 (PyObject*) + 48 (Python C struct)
+ string_length + 1
-</code></pre></div></div>
-
-<p>Suppose that we read a CSV file with</p>
-
-<ul>
- <li>1 column</li>
- <li>1 million rows</li>
- <li>Each value in the column is a string with 10 characters</li>
-</ul>
-
-<p>On disk this file would take approximately 10MB. Read into memory,
however, it
-could take up over 60MB, as a 10 character string object takes up 67 bytes in a
-<code
class="highlighter-rouge">pandas.Series</code>.</p>
-
-<h2 id="how-apache-arrow-represents-strings">How Apache Arrow
represents strings</h2>
-
-<p>While a Python unicode string can have 57 bytes of overhead, a string
in the
-Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of
-overhead. 32-bit integer offsets encodes the position and size of a string
-value in a contiguous chunk of memory:</p>
-
-<div align="center">
-<img src="/img/20190205-arrow-string.png" alt="Apache Arrow
string memory layout" width="80%"
class="img-responsive" />
-</div>
-
-<p>When you call <code
class="highlighter-rouge">table.to_pandas()</code> or
<code class="highlighter-rouge">array.to_pandas()</code>
with <code class="highlighter-rouge">pyarrow</code>, we
-have to convert this compact string representation back to pandas’s
-Python-based strings. This can use a huge amount of memory when we have a large
-number of small strings. It is a quite common occurrence when working with web
-analytics data, which compresses to a compact size when stored in the Parquet
-columnar file format.</p>
-
-<p>Note that the Arrow string memory format has other benefits beyond
memory
-use. It is also much more efficient for analytics due to the guarantee of data
-locality; all strings are next to each other in memory. In the case of pandas
-and Python strings, the string data can be located anywhere in the process
-heap. Arrow PMC member Uwe Korn did some work to <a
href="https://www.slideshare.net/xhochy/extending-pandas-using-apache-arrow-and-numba">extend
pandas with Arrow
-string arrays</a> for improved performance and memory use.</p>
-
-<h2
id="reducing-pandas-memory-use-when-converting-from-arrow">Reducing
pandas memory use when converting from Arrow</h2>
-
-<p>For many years, the <code
class="highlighter-rouge">pandas.read_csv</code> function
has relied on a trick to limit
-the amount of string memory allocated. Because pandas uses arrays of
-<code class="highlighter-rouge">PyObject*</code>
pointers to refer to objects in the Python heap, we can avoid
-creating multiple strings with the same value, instead reusing existing objects
-and incrementing their reference counts.</p>
-
-<p>Schematically, we have the following:</p>
-
-<div align="center">
-<img src="/img/20190205-numpy-string.png" alt="pandas string
memory optimization" width="80%"
class="img-responsive" />
-</div>
-
-<p>In <code
class="highlighter-rouge">pyarrow</code> 0.12, we have
implemented this when calling <code
class="highlighter-rouge">to_pandas</code>. It
-requires using a hash table to deduplicate the Arrow string data as it’s being
-converted to pandas. Hashing data is not free, but counterintuitively it can be
-faster in addition to being vastly more memory efficient in the common case in
-analytics where we have table columns with many instances of the same string
-values.</p>
-
-<h2 id="memory-and-performance-benchmarks">Memory and
Performance Benchmarks</h2>
-
-<p>We can use the <a
href="https://pypi.org/project/memory-profiler/"><code
class="highlighter-rouge">memory_profiler</code></a>
Python package to easily get process
-memory usage within a running Python application.</p>
-
-<div class="language-python highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code><span
class="kn">import</span> <span
class="nn">memory_profiler</span>
-<span class="k">def</span> <span
class="nf">mem</span><span
class="p">():</span>
- <span class="k">return</span> <span
class="n">memory_profiler</span><span
class="p">.</span><span
class="n">memory_usage</span><span
class="p">()[</span><span
class="mi">0</span><span
class="p">]</span>
-</code></pre></div></div>
-
-<p>In a new application I have:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [7]: mem()
-Out[7]: 86.21875
-</code></pre></div></div>
-
-<p>I will generate approximate 1 gigabyte of string data represented as
Python
-strings with length 10. The <code
class="highlighter-rouge">pandas.util.testing</code> module
has a handy <code class="highlighter-rouge">rands</code>
-function for generating random strings. Here is the data generation
function:</p>
-
-<div class="language-python highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code><span
class="kn">from</span> <span
class="nn">pandas.util.testing</span> <span
class="kn">import</span> <span
class="n">rands</span>
-<span class="k">def</span> <span
class="nf">generate_strings</span><span
class="p">(</span><span
class="n">length</span><span
class="p">,</span> <span
class="n">nunique</span><span
class="p">,</span> <span
class="n">string_length</span><span
class="o">=</span><span class="mi">1 [...]
- <span class="n">unique_values</span> <span
class="o">=</span> <span
class="p">[</span><span
class="n">rands</span><span
class="p">(</span><span
class="n">string_length</span><span
class="p">)</span> <span
class="k">for</span> <span
class="n">i</span> <span class="ow">in<
[...]
- <span class="n">values</span> <span
class="o">=</span> <span
class="n">unique_values</span> <span
class="o">*</span> <span
class="p">(</span><span
class="n">length</span> <span
class="o">//</span> <span
class="n">nunique</span><span
class="p">)</span>
- <span class="k">return</span> <span
class="n">values</span>
-</code></pre></div></div>
-
-<p>This generates a certain number of unique strings, then duplicates
then to
-yield the desired number of total strings. So I’m going to create 100 million
-strings with only 10000 unique values:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [8]: values =
generate_strings(100000000, 10000)
-
-In [9]: mem()
-Out[9]: 852.140625
-</code></pre></div></div>
-
-<p>100 million <code
class="highlighter-rouge">PyObject*</code> values is only
745 MB, so this increase of a little
-over 770 MB is consistent with what we know so far. Now I’m going to convert
-this to Arrow format:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [11]: arr = pa.array(values)
-
-In [12]: mem()
-Out[12]: 2276.9609375
-</code></pre></div></div>
-
-<p>Since <code
class="highlighter-rouge">pyarrow</code> exactly accounts
for all of its memory allocations, we also
-check that</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [13]: pa.total_allocated_bytes()
-Out[13]: 1416777280
-</code></pre></div></div>
-
-<p>Since each string takes about 14 bytes (10 bytes plus 4 bytes of
overhead),
-this is what we expect.</p>
-
-<p>Now, converting <code
class="highlighter-rouge">arr</code> back to pandas is where
things get tricky. The <em>minimum</em>
-amount of memory that pandas can use is a little under 800 MB as above as we
-need 100 million <code
class="highlighter-rouge">PyObject*</code> values, which are
8 bytes each.</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [14]: arr_as_pandas =
arr.to_pandas()
-
-In [15]: mem()
-Out[15]: 3041.78125
-</code></pre></div></div>
-
-<p>Doing the math, we used 765 MB which seems right. We can disable the
string
-deduplication logic by passing <code
class="highlighter-rouge">deduplicate_objects=False</code>
to <code
class="highlighter-rouge">to_pandas</code>:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [16]: arr_as_pandas_no_dedup =
arr.to_pandas(deduplicate_objects=False)
-
-In [17]: mem()
-Out[17]: 10006.95703125
-</code></pre></div></div>
-
-<p>Without object deduplication, we use 6965 megabytes, or an average of
73 bytes
-per value. This is a little bit higher than the theoretical size of 67 bytes
-computed above.</p>
-
-<p>One of the more surprising results is that the new behavior is about
twice as fast:</p>
-
-<div class="highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code>In [18]: %time arr_as_pandas_time =
arr.to_pandas()
-CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
-Wall time: 3.14 s
-
-In [19]: %time arr_as_pandas_no_dedup_time =
arr.to_pandas(deduplicate_objects=False)
-CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
-Wall time: 6.21 s
-</code></pre></div></div>
-
-<p>The reason for this is that creating so many Python objects is more
expensive
-than hashing the 10 byte values and looking them up in a hash table.</p>
-
-<p>Note that when you convert Arrow data with mostly unique values back
to pandas,
-the memory use benefits here won’t have as much of an impact.</p>
-
-<h2 id="takeaways">Takeaways</h2>
-
-<p>In Apache Arrow, our goal is to develop computational tools to
operate natively
-on the cache- and SIMD-friendly efficient Arrow columnar format. In the
-meantime, though, we recognize that users have legacy applications using the
-native memory layout of pandas or other analytics tools. We will do our best to
-provide fast and memory-efficient interoperability with pandas and other
-popular
libraries.</p></content><author><name>wesm</name></author><summary
type="html">Python users who upgrade to recently released pyarrow 0.12 may find
that their applications use significantly less memory when converting Arrow
string data to pandas format. This includes using pyarrow.parquet.read_table
and pandas.read_parquet. This article details some of what is going on under
the hood, and why Python applications dealing with large amounts of strings are
prone to memory use p [...]
\ No newline at end of file
+</ul></content><author><name>wesm</name></author><summary
type="html">The Apache Arrow team is pleased to announce the 0.13.0 release.
This covers more than 2 months of development work and includes 550 resolved
issues from 81 distinct contributors. See the Install Page to learn how to get
the libraries for your platform. The complete changelog is also available.
While it’s a large release, this post will give some brief highlights in the
project since the 0.12.0 release from Janua [...]
\ No newline at end of file