This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new dacfa5d Updating built site (build
92577cea17b5ecc40cbc0ee1bdbb9c26dc15d38f)
dacfa5d is described below
commit dacfa5dca3bc91da9baad6aee6d4be3e7fe63918
Author: Antoine Pitrou <[email protected]>
AuthorDate: Tue Apr 7 17:37:38 2020 +0000
Updating built site (build 92577cea17b5ecc40cbc0ee1bdbb9c26dc15d38f)
---
...manifest-0a234a25acb2cb99ac3a783e67850385.json} | 2 +-
blog/2020/03/31/fuzzing-arrow-ipc/index.html | 314 +++++++++++++++++++++
blog/index.html | 92 ++++++
feed.xml | 162 +++++------
4 files changed, 472 insertions(+), 98 deletions(-)
diff --git a/assets/.sprockets-manifest-7cda7bb14183b60d9367745093bd155f.json
b/assets/.sprockets-manifest-0a234a25acb2cb99ac3a783e67850385.json
similarity index 79%
rename from assets/.sprockets-manifest-7cda7bb14183b60d9367745093bd155f.json
rename to assets/.sprockets-manifest-0a234a25acb2cb99ac3a783e67850385.json
index 738a3f4..1f85ee8 100644
--- a/assets/.sprockets-manifest-7cda7bb14183b60d9367745093bd155f.json
+++ b/assets/.sprockets-manifest-0a234a25acb2cb99ac3a783e67850385.json
@@ -1 +1 @@
-{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-02T17:30:15-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
+{"files":{"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js":{"logical_path":"main.js","mtime":"2020-04-07T13:37:29-04:00","size":124531,"digest":"18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33","integrity":"sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="}},"assets":{"main.js":"main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"}}
\ No newline at end of file
diff --git a/blog/2020/03/31/fuzzing-arrow-ipc/index.html
b/blog/2020/03/31/fuzzing-arrow-ipc/index.html
new file mode 100644
index 0000000..12a1a97
--- /dev/null
+++ b/blog/2020/03/31/fuzzing-arrow-ipc/index.html
@@ -0,0 +1,314 @@
+<!DOCTYPE html>
+<html lang="en-US">
+ <head>
+ <meta charset="UTF-8">
+ <meta http-equiv="X-UA-Compatible" content="IE=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1">
+ <!-- The above meta tags *must* come first in the head; any other head
content must come *after* these tags -->
+
+ <title>Fuzzing the Arrow C++ IPC implementation | Apache Arrow</title>
+
+
+ <!-- Begin Jekyll SEO tag v2.6.1 -->
+<meta name="generator" content="Jekyll v3.8.4" />
+<meta property="og:title" content="Fuzzing the Arrow C++ IPC implementation" />
+<meta name="author" content="apitrou" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="We have set up continuous fuzzing for the
Arrow C++ IPC reader. This helped us find and correct several issues where
missing input validation would lead to crashes or undefined behaviour." />
+<meta property="og:description" content="We have set up continuous fuzzing for
the Arrow C++ IPC reader. This helped us find and correct several issues where
missing input validation would lead to crashes or undefined behaviour." />
+<link rel="canonical"
href="https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/" />
+<meta property="og:url"
content="https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/" />
+<meta property="og:site_name" content="Apache Arrow" />
+<meta property="og:image" content="https://arrow.apache.org/img/arrow.png" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2020-03-31T19:00:00-04:00" />
+<meta name="twitter:card" content="summary_large_image" />
+<meta property="twitter:image"
content="https://arrow.apache.org/img/arrow.png" />
+<meta property="twitter:title" content="Fuzzing the Arrow C++ IPC
implementation" />
+<meta name="twitter:site" content="@ApacheArrow" />
+<meta name="twitter:creator" content="@apitrou" />
+<script type="application/ld+json">
+{"@type":"BlogPosting","headline":"Fuzzing the Arrow C++ IPC
implementation","dateModified":"2020-03-31T19:00:00-04:00","datePublished":"2020-03-31T19:00:00-04:00","mainEntityOfPage":{"@type":"WebPage","@id":"https://arrow.apache.org/blog/2020/03/31/fuzzing-arrow-ipc/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://arrow.apache.org/img/logo.png"},"name":"apitrou"},"author":{"@type":"Person","name":"apitrou"},"image":"https://arrow.apache.org/img/arrow.p
[...]
+<!-- End Jekyll SEO tag -->
+
+
+ <!-- favicons -->
+ <link rel="icon" type="image/png" sizes="16x16"
href="/img/favicon-16x16.png" id="light1">
+ <link rel="icon" type="image/png" sizes="32x32"
href="/img/favicon-32x32.png" id="light2">
+ <link rel="apple-touch-icon" type="image/png" sizes="180x180"
href="/img/apple-touch-icon.png" id="light3">
+ <link rel="apple-touch-icon" type="image/png" sizes="120x120"
href="/img/apple-touch-icon-120x120.png" id="light4">
+ <link rel="apple-touch-icon" type="image/png" sizes="76x76"
href="/img/apple-touch-icon-76x76.png" id="light5">
+ <link rel="apple-touch-icon" type="image/png" sizes="60x60"
href="/img/apple-touch-icon-60x60.png" id="light6">
+ <!-- dark mode favicons -->
+ <link rel="icon" type="image/png" sizes="16x16"
href="/img/favicon-16x16-dark.png" id="dark1">
+ <link rel="icon" type="image/png" sizes="32x32"
href="/img/favicon-32x32-dark.png" id="dark2">
+ <link rel="apple-touch-icon" type="image/png" sizes="180x180"
href="/img/apple-touch-icon-dark.png" id="dark3">
+ <link rel="apple-touch-icon" type="image/png" sizes="120x120"
href="/img/apple-touch-icon-120x120-dark.png" id="dark4">
+ <link rel="apple-touch-icon" type="image/png" sizes="76x76"
href="/img/apple-touch-icon-76x76-dark.png" id="dark5">
+ <link rel="apple-touch-icon" type="image/png" sizes="60x60"
href="/img/apple-touch-icon-60x60-dark.png" id="dark6">
+
+ <script>
+ // Switch to the dark-mode favicons if prefers-color-scheme: dark
+ function onUpdate() {
+ light1 = document.querySelector('link#light1');
+ light2 = document.querySelector('link#light2');
+ light3 = document.querySelector('link#light3');
+ light4 = document.querySelector('link#light4');
+ light5 = document.querySelector('link#light5');
+ light6 = document.querySelector('link#light6');
+
+ dark1 = document.querySelector('link#dark1');
+ dark2 = document.querySelector('link#dark2');
+ dark3 = document.querySelector('link#dark3');
+ dark4 = document.querySelector('link#dark4');
+ dark5 = document.querySelector('link#dark5');
+ dark6 = document.querySelector('link#dark6');
+
+ if (matcher.matches) {
+ light1.remove();
+ light2.remove();
+ light3.remove();
+ light4.remove();
+ light5.remove();
+ light6.remove();
+ document.head.append(dark1);
+ document.head.append(dark2);
+ document.head.append(dark3);
+ document.head.append(dark4);
+ document.head.append(dark5);
+ document.head.append(dark6);
+ } else {
+ dark1.remove();
+ dark2.remove();
+ dark3.remove();
+ dark4.remove();
+ dark5.remove();
+ dark6.remove();
+ document.head.append(light1);
+ document.head.append(light2);
+ document.head.append(light3);
+ document.head.append(light4);
+ document.head.append(light5);
+ document.head.append(light6);
+ }
+ }
+ matcher = window.matchMedia('(prefers-color-scheme: dark)');
+ matcher.addListener(onUpdate);
+ onUpdate();
+ </script>
+
+ <link rel="stylesheet"
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+ <link href="/css/main.css" rel="stylesheet">
+ <link href="/css/syntax.css" rel="stylesheet">
+ <script src="https://code.jquery.com/jquery-3.3.1.slim.min.js"
integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
crossorigin="anonymous"></script>
+ <script
src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.3/umd/popper.min.js"
integrity="sha384-ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
+
+ <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async
src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script>
+<script>
+ window.dataLayer = window.dataLayer || [];
+ function gtag(){dataLayer.push(arguments)};
+ gtag('js', new Date());
+
+ gtag('config', 'UA-107500873-1');
+</script>
+
+
+ </head>
+
+
+<body class="wrap">
+ <header>
+ <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+ <a class="navbar-brand" href="/"><img src="/img/arrow-inverse-300px.png"
height="60px"/></a>
+ <button class="navbar-toggler" type="button" data-toggle="collapse"
data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <!-- Collect the nav links, forms, and other content for toggling -->
+ <div class="collapse navbar-collapse" id="arrow-navbar">
+ <ul class="nav navbar-nav">
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownProjectLinks" role="button"
data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Project Links
+ </a>
+ <div class="dropdown-menu"
aria-labelledby="navbarDropdownProjectLinks">
+ <a class="dropdown-item" href="/install/">Installation</a>
+ <a class="dropdown-item" href="/release/">Releases</a>
+ <a class="dropdown-item" href="/faq/">FAQ</a>
+ <a class="dropdown-item" href="/blog/">Blog</a>
+ <a class="dropdown-item"
href="https://github.com/apache/arrow">Source Code</a>
+ <a class="dropdown-item"
href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a>
+ </div>
+ </li>
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownCommunity" role="button" data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Community
+ </a>
+ <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+ <a class="dropdown-item"
href="http://mail-archives.apache.org/mod_mbox/arrow-user/">User Mailing
List</a>
+ <a class="dropdown-item"
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Dev Mailing List</a>
+ <a class="dropdown-item"
href="https://cwiki.apache.org/confluence/display/ARROW">Developer Wiki</a>
+ <a class="dropdown-item" href="/committers/">Committers</a>
+ <a class="dropdown-item" href="/powered_by/">Powered By</a>
+ </div>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/docs/format/Columnar.html"
+ role="button" aria-haspopup="true" aria-expanded="false">
+ Specification
+ </a>
+ </li>
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownDocumentation" role="button"
data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ Documentation
+ </a>
+ <div class="dropdown-menu"
aria-labelledby="navbarDropdownDocumentation">
+ <a class="dropdown-item" href="/docs">Project Docs</a>
+ <a class="dropdown-item" href="/docs/python">Python</a>
+ <a class="dropdown-item" href="/docs/cpp">C++</a>
+ <a class="dropdown-item" href="/docs/java">Java</a>
+ <a class="dropdown-item" href="/docs/c_glib">C GLib</a>
+ <a class="dropdown-item" href="/docs/js">JavaScript</a>
+ <a class="dropdown-item" href="/docs/r">R</a>
+ </div>
+ </li>
+ <!-- <li><a href="/blog">Blog</a></li> -->
+ <li class="nav-item dropdown">
+ <a class="nav-link dropdown-toggle" href="#"
+ id="navbarDropdownASF" role="button" data-toggle="dropdown"
+ aria-haspopup="true" aria-expanded="false">
+ ASF Links
+ </a>
+ <div class="dropdown-menu" aria-labelledby="navbarDropdownASF">
+ <a class="dropdown-item" href="http://www.apache.org/">ASF
Website</a>
+ <a class="dropdown-item"
href="http://www.apache.org/licenses/">License</a>
+ <a class="dropdown-item"
href="http://www.apache.org/foundation/sponsorship.html">Donate</a>
+ <a class="dropdown-item"
href="http://www.apache.org/foundation/thanks.html">Thanks</a>
+ <a class="dropdown-item"
href="http://www.apache.org/security/">Security</a>
+ </div>
+ </li>
+ </ul>
+ <div class="flex-row justify-content-end ml-md-auto">
+ <a class="d-sm-none d-md-inline pr-2"
href="https://www.apache.org/events/current-event.html">
+ <img src="https://www.apache.org/events/current-event-234x60.png"/>
+ </a>
+ <a href="http://www.apache.org/">
+ <img src="/img/asf_logo.svg" width="120px"/>
+ </a>
+ </div>
+ </div><!-- /.navbar-collapse -->
+ </div>
+ </nav>
+
+ </header>
+
+ <div class="container p-lg-4">
+ <main role="main">
+
+
+
+<h1>
+ Fuzzing the Arrow C++ IPC implementation
+ <a href="/blog/2020/03/31/fuzzing-arrow-ipc/" class="permalink"
title="Permalink">∞</a>
+</h1>
+
+
+
+<p>
+ <span class="badge badge-secondary">Published</span>
+ <span class="published">
+ 31 Mar 2020
+ </span>
+ <br />
+ <span class="badge badge-secondary">By</span>
+
+ <a href="http://github.com/pitrou">Antoine Pitrou (apitrou) </a>
+
+</p>
+
+
+ <!--
+
+-->
+
+<p>Apache Arrow aims to allow fast and seamless data interchange between
+heterogenous runtimes and environments. Whether using the columnar
+<a href="https://arrow.apache.org/docs/format/Columnar.html">IPC stream
protocol</a>,
+the <a href="https://arrow.apache.org/docs/format/Flight.html">Flight</a> RPC
layer,
+the Feather file format, the
+<a href="https://arrow.apache.org/docs/python/plasma.html">Plasma</a> shared
object
+store, or any application-specific data distribution mechanism, Arrow IPC
+implementations may try to decode data from untrusted input. While it is ok
+to report an error in that case, Arrow shouldn’t crash or engage in risky
+behaviour while reading such data.</p>
+
+<p>To validate the robustness of the Arrow C++ IPC reader (which also underlies
+the Python, C/GLib, R and Ruby bindings), we
+<a href="https://github.com/google/oss-fuzz/pull/3233">successfully
submitted</a>
+the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical
+open source projects, provided by Google.</p>
+
+<h2 id="what-is-being-fuzzed">What is being fuzzed</h2>
+
+<p>As of this writing, the <code
class="highlighter-rouge">RecordBatchStreamReader</code> and <code
class="highlighter-rouge">RecordBatchFileReader</code>
+C++ classes are being fuzzed by feeding them data generated by the fuzzer.</p>
+
+<p>When a record batch is successfully read by one of those classes, the
+fuzzing setup then validates it using <code
class="highlighter-rouge">RecordBatch::ValidateFull</code>. This
+method can either succeed or fail, but it shouldn’t crash.</p>
+
+<p>By ensuring that reading a record batch from IPC, then validating it, always
+shows deterministic behaviour, we hope to make it relatively safe to ingest
+Arrow IPC data coming from untrusted sources.</p>
+
+<p>(of course, it is still recommended for security-critical applications
+ to use cryptographic means of authentication and integrity control – for
+ example, to enable TLS with the Flight RPC protocol)</p>
+
+<h2 id="how-we-help-the-fuzzer-find-problems">How we help the fuzzer find
problems</h2>
+
+<p>Fuzzing is a brute force process that tries to devise invalid data to
+exercise an implementation’s response. By default, the fuzzer does not know
+anything about the data representation expected by the program under test.
+Fuzzing can therefore be extremely inefficient, testing tons of uninteresting
+variations while missing critical ones.</p>
+
+<p>To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC
+files with various data types. By starting from this data and mutating it to
+find invalid variations, OSS-Fuzz was able to find tens of issues with data
+validation. All of them have been fixed. As of this writing, no new issue
+in the IPC layer was found since March 4th 2020.</p>
+
+<h2 id="what-comes-next">What comes next</h2>
+
+<p>Of course, we still monitor OSS-Fuzz for any new problem that could be found
+in the C++ IPC implementation. Such problems might for example appear when
adding
+features to the Arrow <a
href="https://arrow.apache.org/docs/format/Columnar.html">IPC format</a>.</p>
+
+<p>We have started fuzzing the Parquet C++ implementation. Several issues have
+been found and fixed, but more are still coming. We hope to stabilize the
+situation in the next month or two.</p>
+
+<p>The tensor and sparse tensor IPC read paths are not being exercised yet.
+They will be once a motivated core developer wants to own the topic.</p>
+
+ </main>
+
+ <hr/>
+<footer class="footer">
+ <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache
Arrow project logo are either registered trademarks or trademarks of The Apache
Software Foundation in the United States and other countries.</p>
+ <p>© 2016-2019 The Apache Software Foundation</p>
+ <script type="text/javascript"
src="/assets/main-18cd3029557f73c1ee82e41113127b04f6fcd84c56d9db0cb9c40ebe26ef6e33.js"
integrity="sha256-GM0wKVV/c8HuguQRExJ7BPb82ExW2dsMucQOvibvbjM="
crossorigin="anonymous"></script>
+</footer>
+
+ </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index ae8c122..4458554 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -219,6 +219,98 @@
<div class="blog-post" style="margin-bottom: 4rem">
<h1>
+ Fuzzing the Arrow C++ IPC implementation
+ <a href="/blog/2020/03/31/fuzzing-arrow-ipc/" class="permalink"
title="Permalink">∞</a>
+</h1>
+
+
+
+<p>
+ <span class="badge badge-secondary">Published</span>
+ <span class="published">
+ 31 Mar 2020
+ </span>
+ <br />
+ <span class="badge badge-secondary">By</span>
+
+ <a href="http://github.com/pitrou">Antoine Pitrou (apitrou) </a>
+
+</p>
+
+ <!--
+
+-->
+
+<p>Apache Arrow aims to allow fast and seamless data interchange between
+heterogenous runtimes and environments. Whether using the columnar
+<a href="https://arrow.apache.org/docs/format/Columnar.html">IPC stream
protocol</a>,
+the <a href="https://arrow.apache.org/docs/format/Flight.html">Flight</a> RPC
layer,
+the Feather file format, the
+<a href="https://arrow.apache.org/docs/python/plasma.html">Plasma</a> shared
object
+store, or any application-specific data distribution mechanism, Arrow IPC
+implementations may try to decode data from untrusted input. While it is ok
+to report an error in that case, Arrow shouldn’t crash or engage in risky
+behaviour while reading such data.</p>
+
+<p>To validate the robustness of the Arrow C++ IPC reader (which also underlies
+the Python, C/GLib, R and Ruby bindings), we
+<a href="https://github.com/google/oss-fuzz/pull/3233">successfully
submitted</a>
+the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical
+open source projects, provided by Google.</p>
+
+<h2 id="what-is-being-fuzzed">What is being fuzzed</h2>
+
+<p>As of this writing, the <code
class="highlighter-rouge">RecordBatchStreamReader</code> and <code
class="highlighter-rouge">RecordBatchFileReader</code>
+C++ classes are being fuzzed by feeding them data generated by the fuzzer.</p>
+
+<p>When a record batch is successfully read by one of those classes, the
+fuzzing setup then validates it using <code
class="highlighter-rouge">RecordBatch::ValidateFull</code>. This
+method can either succeed or fail, but it shouldn’t crash.</p>
+
+<p>By ensuring that reading a record batch from IPC, then validating it, always
+shows deterministic behaviour, we hope to make it relatively safe to ingest
+Arrow IPC data coming from untrusted sources.</p>
+
+<p>(of course, it is still recommended for security-critical applications
+ to use cryptographic means of authentication and integrity control – for
+ example, to enable TLS with the Flight RPC protocol)</p>
+
+<h2 id="how-we-help-the-fuzzer-find-problems">How we help the fuzzer find
problems</h2>
+
+<p>Fuzzing is a brute force process that tries to devise invalid data to
+exercise an implementation’s response. By default, the fuzzer does not know
+anything about the data representation expected by the program under test.
+Fuzzing can therefore be extremely inefficient, testing tons of uninteresting
+variations while missing critical ones.</p>
+
+<p>To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC
+files with various data types. By starting from this data and mutating it to
+find invalid variations, OSS-Fuzz was able to find tens of issues with data
+validation. All of them have been fixed. As of this writing, no new issue
+in the IPC layer was found since March 4th 2020.</p>
+
+<h2 id="what-comes-next">What comes next</h2>
+
+<p>Of course, we still monitor OSS-Fuzz for any new problem that could be found
+in the C++ IPC implementation. Such problems might for example appear when
adding
+features to the Arrow <a
href="https://arrow.apache.org/docs/format/Columnar.html">IPC format</a>.</p>
+
+<p>We have started fuzzing the Parquet C++ implementation. Several issues have
+been found and fixed, but more are still coming. We hope to stabilize the
+situation in the next month or two.</p>
+
+<p>The tensor and sparse tensor IPC read paths are not being exercised yet.
+They will be once a motivated core developer wants to own the topic.</p>
+
+ </div>
+
+
+
+
+
+ <div class="blog-post" style="margin-bottom: 4rem">
+
+<h1>
Apache Arrow 0.16.0 Release
<a href="/blog/2020/02/12/0.16.0-release/" class="permalink"
title="Permalink">∞</a>
</h1>
diff --git a/feed.xml b/feed.xml
index 19eaf27..feff006 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,67 @@
-<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="3.8.4">Jekyll</generator><link
href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml"
/><link href="https://arrow.apache.org/" rel="alternate" type="text/html"
/><updated>2020-04-02T17:30:10-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language
developm [...]
+<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="3.8.4">Jekyll</generator><link
href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml"
/><link href="https://arrow.apache.org/" rel="alternate" type="text/html"
/><updated>2020-04-07T13:37:24-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language
developm [...]
+
+-->
+
+<p>Apache Arrow aims to allow fast and seamless data interchange between
+heterogenous runtimes and environments. Whether using the columnar
+<a
href="https://arrow.apache.org/docs/format/Columnar.html">IPC
stream protocol</a>,
+the <a
href="https://arrow.apache.org/docs/format/Flight.html">Flight</a>
RPC layer,
+the Feather file format, the
+<a
href="https://arrow.apache.org/docs/python/plasma.html">Plasma</a>
shared object
+store, or any application-specific data distribution mechanism, Arrow IPC
+implementations may try to decode data from untrusted input. While it is ok
+to report an error in that case, Arrow shouldn’t crash or engage in risky
+behaviour while reading such data.</p>
+
+<p>To validate the robustness of the Arrow C++ IPC reader (which also
underlies
+the Python, C/GLib, R and Ruby bindings), we
+<a
href="https://github.com/google/oss-fuzz/pull/3233">successfully
submitted</a>
+the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical
+open source projects, provided by Google.</p>
+
+<h2 id="what-is-being-fuzzed">What is being fuzzed</h2>
+
+<p>As of this writing, the <code
class="highlighter-rouge">RecordBatchStreamReader</code> and
<code
class="highlighter-rouge">RecordBatchFileReader</code>
+C++ classes are being fuzzed by feeding them data generated by the
fuzzer.</p>
+
+<p>When a record batch is successfully read by one of those classes, the
+fuzzing setup then validates it using <code
class="highlighter-rouge">RecordBatch::ValidateFull</code>.
This
+method can either succeed or fail, but it shouldn’t crash.</p>
+
+<p>By ensuring that reading a record batch from IPC, then validating it,
always
+shows deterministic behaviour, we hope to make it relatively safe to ingest
+Arrow IPC data coming from untrusted sources.</p>
+
+<p>(of course, it is still recommended for security-critical applications
+ to use cryptographic means of authentication and integrity control – for
+ example, to enable TLS with the Flight RPC protocol)</p>
+
+<h2 id="how-we-help-the-fuzzer-find-problems">How we help the
fuzzer find problems</h2>
+
+<p>Fuzzing is a brute force process that tries to devise invalid data to
+exercise an implementation’s response. By default, the fuzzer does not know
+anything about the data representation expected by the program under test.
+Fuzzing can therefore be extremely inefficient, testing tons of uninteresting
+variations while missing critical ones.</p>
+
+<p>To help guide the fuzzing process, we added a seed corpus of valid
Arrow IPC
+files with various data types. By starting from this data and mutating it to
+find invalid variations, OSS-Fuzz was able to find tens of issues with data
+validation. All of them have been fixed. As of this writing, no new issue
+in the IPC layer was found since March 4th 2020.</p>
+
+<h2 id="what-comes-next">What comes next</h2>
+
+<p>Of course, we still monitor OSS-Fuzz for any new problem that could
be found
+in the C++ IPC implementation. Such problems might for example appear when
adding
+features to the Arrow <a
href="https://arrow.apache.org/docs/format/Columnar.html">IPC
format</a>.</p>
+
+<p>We have started fuzzing the Parquet C++ implementation. Several
issues have
+been found and fixed, but more are still coming. We hope to stabilize the
+situation in the next month or two.</p>
+
+<p>The tensor and sparse tensor IPC read paths are not being exercised
yet.
+They will be once a motivated core developer wants to own the
topic.</p></content><author><name>apitrou</name></author><media:thumbnail
xmlns:media="http://search.yahoo.com/mrss/"
url="https://arrow.apache.org/img/arrow.png" /><media:content medium="image"
url="https://arrow.apache.org/img/arrow.png"
xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title
type="html">Apache Arrow 0.16.0 Release</title><link
href="https://arrow.apache.org/blog/2020/02/12/0.16.0-release/" [...]
-->
@@ -1927,99 +1990,4 @@ on the cache- and SIMD-friendly efficient Arrow columnar
format. In the
meantime, though, we recognize that users have legacy applications using the
native memory layout of pandas or other analytics tools. We will do our best to
provide fast and memory-efficient interoperability with pandas and other
-popular
libraries.</p></content><author><name>wesm</name></author><media:thumbnail
xmlns:media="http://search.yahoo.com/mrss/"
url="https://arrow.apache.org/img/arrow.png" /><media:content medium="image"
url="https://arrow.apache.org/img/arrow.png"
xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title
type="html">DataFusion: A Rust-native Query Engine for Apache
Arrow</title><link
href="https://arrow.apache.org/blog/2019/02/04/datafusion-donation/"
rel="alternate" typ [...]
-
--->
-
-<p>We are excited to announce that <a
href="https://github.com/apache/arrow/tree/master/rust/datafusion">DataFusion</a>
has been donated to the Apache Arrow project. DataFusion is an in-memory query
engine for the Rust implementation of Apache Arrow.</p>
-
-<p>Although DataFusion was started two years ago, it was recently
re-implemented to be Arrow-native and currently has limited capabilities but
does support SQL queries against iterators of RecordBatch and has support for
CSV files. There are plans to <a
href="https://issues.apache.org/jira/browse/ARROW-4466">add
support for Parquet files</a>.</p>
-
-<p>SQL support is limited to projection (<code
class="highlighter-rouge">SELECT</code>), selection
(<code class="highlighter-rouge">WHERE</code>), and
simple aggregates (<code
class="highlighter-rouge">MIN</code>, <code
class="highlighter-rouge">MAX</code>, <code
class="highlighter-rouge">SUM</code>) with an optional
<code class="highlighter-rouge">GROUP BY& [...]
-
-<p>Supported expressions are identifiers, literals, simple math
operations (<code class="highlighter-rouge">+</code>,
<code class="highlighter-rouge">-</code>, <code
class="highlighter-rouge">*</code>, <code
class="highlighter-rouge">/</code>), binary expressions
(<code class="highlighter-rouge">AND</code>, <code
class="highlighter-rouge">OR</code>), e [...]
-
-<h2 id="example">Example</h2>
-
-<p>The following example demonstrates running a simple aggregate SQL
query against a CSV file.</p>
-
-<div class="language-rust highlighter-rouge"><div
class="highlight"><pre
class="highlight"><code><span class="c">//
create execution context</span>
-<span class="k">let</span> <span
class="k">mut</span> <span
class="n">ctx</span> <span
class="o">=</span> <span
class="nn">ExecutionContext</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">();</span>
-
-<span class="c">// define schema for data source (csv
file)</span>
-<span class="k">let</span> <span
class="n">schema</span> <span
class="o">=</span> <span
class="nn">Arc</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="nn">Schema</span><span
class="p">::</span><span
class="nf">new</span><s [...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c1"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Utf8</span><span class="p">,</s
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c2"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">UInt32</span><span class="p">,<
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c3"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Int8</span><span class="p">,</s
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c4"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Int16</span><span class="p">,</
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c5"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Int32</span><span class="p">,</
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c6"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Int64</span><span class="p">,</
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c7"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">UInt8</span><span class="p">,</
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c8"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">UInt16</span><span class="p">,<
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c9"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">UInt32</span><span class="p">,<
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c10"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">UInt64</span><span class="p">,<
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c11"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Float32</span><span class="p">,&l
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c12"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Float64</span><span class="p">,&l
[...]
- <span class="nn">Field</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"c13"</span><span
class="p">,</span> <span
class="nn">DataType</span><span
class="p">::</span><span
class="n">Utf8</span><span class="p">,</
[...]
-<span class="p">]));</span>
-
-<span class="c">// register csv file with the execution
context</span>
-<span class="k">let</span> <span
class="n">csv_datasource</span> <span
class="o">=</span>
- <span class="nn">CsvDataSource</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span
class="s">"test/data/aggregate_test_100.csv"</span><span
class="p">,</span> <span
class="n">schema</span><span
class="nf">.clone</span><span
class="p">(),</span> [...]
-<span class="n">ctx</span><span
class="nf">.register_datasource</span><span
class="p">(</span><span
class="s">"aggregate_test_100"</span><span
class="p">,</span> <span
class="nn">Rc</span><span
class="p">::</span><span
class="nf">new</span><span
class="p">(</span><span class=" [...]
-
-<span class="k">let</span> <span
class="n">sql</span> <span
class="o">=</span> <span
class="s">"SELECT c1, MIN(c12), MAX(c12) FROM
aggregate_test_100 WHERE c11 &gt; 0.1 AND c11 &lt; 0.9 GROUP BY
c1"</span><span class="p">;</span>
-
-<span class="c">// execute the query</span>
-<span class="k">let</span> <span
class="n">relation</span> <span
class="o">=</span> <span
class="n">ctx</span><span
class="nf">.sql</span><span
class="p">(</span><span
class="o">&amp;</span><span
class="n">sql</span><span
class="p">)</span><span
class="nf">.unwrap</span& [...]
-<span class="k">let</span> <span
class="k">mut</span> <span
class="n">results</span> <span
class="o">=</span> <span
class="n">relation</span><span
class="nf">.borrow_mut</span><span
class="p">();</span>
-
-<span class="c">// iterate over the results</span>
-<span class="k">while</span> <span
class="k">let</span> <span
class="nf">Some</span><span
class="p">(</span><span
class="n">batch</span><span
class="p">)</span> <span
class="o">=</span> <span
class="n">results</span><span
class="nf">.next</span><span
class="p">()</span>&l [...]
- <span class="nd">println!</span><span
class="p">(</span>
- <span class="s">"RecordBatch has {} rows and {}
columns"</span><span class="p">,</span>
- <span class="n">batch</span><span
class="nf">.num_rows</span><span
class="p">(),</span>
- <span class="n">batch</span><span
class="nf">.num_columns</span><span
class="p">()</span>
- <span class="p">);</span>
-
- <span class="k">let</span> <span
class="n">c1</span> <span
class="o">=</span> <span
class="n">batch</span>
- <span class="nf">.column</span><span
class="p">(</span><span
class="mi">0</span><span
class="p">)</span>
- <span class="nf">.as_any</span><span
class="p">()</span>
- <span class="py">.downcast_ref</span><span
class="p">::</span><span
class="o">&lt;</span><span
class="n">BinaryArray</span><span
class="o">&gt;</span><span
class="p">()</span>
- <span class="nf">.unwrap</span><span
class="p">();</span>
-
- <span class="k">let</span> <span
class="n">min</span> <span
class="o">=</span> <span
class="n">batch</span>
- <span class="nf">.column</span><span
class="p">(</span><span
class="mi">1</span><span
class="p">)</span>
- <span class="nf">.as_any</span><span
class="p">()</span>
- <span class="py">.downcast_ref</span><span
class="p">::</span><span
class="o">&lt;</span><span
class="n">Float64Array</span><span
class="o">&gt;</span><span
class="p">()</span>
- <span class="nf">.unwrap</span><span
class="p">();</span>
-
- <span class="k">let</span> <span
class="n">max</span> <span
class="o">=</span> <span
class="n">batch</span>
- <span class="nf">.column</span><span
class="p">(</span><span
class="mi">2</span><span
class="p">)</span>
- <span class="nf">.as_any</span><span
class="p">()</span>
- <span class="py">.downcast_ref</span><span
class="p">::</span><span
class="o">&lt;</span><span
class="n">Float64Array</span><span
class="o">&gt;</span><span
class="p">()</span>
- <span class="nf">.unwrap</span><span
class="p">();</span>
-
- <span class="k">for</span> <span
class="n">i</span> <span
class="n">in</span> <span
class="mi">0</span><span
class="o">..</span><span
class="n">batch</span><span
class="nf">.num_rows</span><span
class="p">()</span> <span
class="p">{</span>
- <span class="k">let</span> <span
class="n">c1_value</span><span
class="p">:</span> <span
class="nb">String</span> <span
class="o">=</span> <span
class="nn">String</span><span
class="p">::</span><span
class="nf">from_utf8</span><span
class="p">(</span><span class="n">c1& [...]
- <span class="nd">println!</span><span
class="p">(</span><span class="s">"{},
Min: {}, Max: {}"</span><span
class="p">,</span> <span
class="n">c1_value</span><span
class="p">,</span> <span
class="n">min</span><span
class="nf">.value</span><span
class="p">(</span><span class [...]
- <span class="p">}</span>
-<span class="p">}</span>
-</code></pre></div></div>
-
-<h2 id="roadmap">Roadmap</h2>
-
-<p>The roadmap for DataFusion will depend on interest from the Rust
community, but here are some of the short term items that are planned:</p>
-
-<ul>
- <li>Extending test coverage of the existing functionality</li>
- <li>Adding support for Parquet data sources</li>
- <li>Implementing more SQL features such as <code
class="highlighter-rouge">JOIN</code>, <code
class="highlighter-rouge">ORDER BY</code> and <code
class="highlighter-rouge">LIMIT</code></li>
- <li>Implement a DataFrame API as an alternative to SQL</li>
- <li>Adding support for partitioning and parallel query execution using
Rust’s async and await functionality</li>
- <li>Creating a Docker image to make it easy to use DataFusion as a
standalone query tool for interactive and batch queries</li>
-</ul>
-
-<h2 id="contributors-welcome">Contributors Welcome!</h2>
-
-<p>If you are excited about being able to use Rust for data science and
would like to contribute to this work then there are many ways to get involved.
The simplest way to get started is to try out DataFusion against your own data
sources and file bug reports for any issues that you find. You could also check
out the current <a
href="https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard">list
of issues</a> and have a go at fixing one. You can a [...]
\ No newline at end of file
+popular
libraries.</p></content><author><name>wesm</name></author><media:thumbnail
xmlns:media="http://search.yahoo.com/mrss/"
url="https://arrow.apache.org/img/arrow.png" /><media:content medium="image"
url="https://arrow.apache.org/img/arrow.png"
xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>
\ No newline at end of file