This is an automated email from the ASF dual-hosted git repository. wesm pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow-site.git
commit 277255ada3d80ccb23053bee80229d7cfabf555a Author: Antoine Pitrou <[email protected]> AuthorDate: Tue Apr 7 13:12:37 2020 -0500 ARROW-7847: [Website] Add blog post about fuzzing the IPC layer --- _data/contributors.yml | 3 ++ _posts/2020-04-01-fuzzing-arrow-ipc.md | 89 ++++++++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+) diff --git a/_data/contributors.yml b/_data/contributors.yml index e70d9af..dcddb10 100644 --- a/_data/contributors.yml +++ b/_data/contributors.yml @@ -49,4 +49,7 @@ - name: Neal Richardson apacheId: npr # Not a real apacheId githubId: nealrichardson +- name: Antoine Pitrou + apacheId: apitrou + githubId: pitrou # End contributors.yml diff --git a/_posts/2020-04-01-fuzzing-arrow-ipc.md b/_posts/2020-04-01-fuzzing-arrow-ipc.md new file mode 100644 index 0000000..b094e1a --- /dev/null +++ b/_posts/2020-04-01-fuzzing-arrow-ipc.md @@ -0,0 +1,89 @@ +--- +layout: post +title: "Fuzzing the Arrow C++ IPC implementation" +description: "We have set up continuous fuzzing for the Arrow C++ IPC reader. +This helped us find and correct several issues where missing input validation +would lead to crashes or undefined behaviour." +date: "2020-04-01 00:00:00 +0100" +author: apitrou +categories: [application] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +Apache Arrow aims to allow fast and seamless data interchange between +heterogenous runtimes and environments. Whether using the columnar +[IPC stream protocol](https://arrow.apache.org/docs/format/Columnar.html), +the [Flight](https://arrow.apache.org/docs/format/Flight.html) RPC layer, +the Feather file format, the +[Plasma](https://arrow.apache.org/docs/python/plasma.html) shared object +store, or any application-specific data distribution mechanism, Arrow IPC +implementations may try to decode data from untrusted input. While it is ok +to report an error in that case, Arrow shouldn't crash or engage in risky +behaviour while reading such data. + +To validate the robustness of the Arrow C++ IPC reader (which also underlies +the Python, C/GLib, R and Ruby bindings), we +[successfully submitted](https://github.com/google/oss-fuzz/pull/3233) +the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical +open source projects, provided by Google. + +## What is being fuzzed + +As of this writing, the `RecordBatchStreamReader` and `RecordBatchFileReader` +C++ classes are being fuzzed by feeding them data generated by the fuzzer. + +When a record batch is successfully read by one of those classes, the +fuzzing setup then validates it using `RecordBatch::ValidateFull`. This +method can either succeed or fail, but it shouldn't crash. + +By ensuring that reading a record batch from IPC, then validating it, always +shows deterministic behaviour, we hope to make it relatively safe to ingest +Arrow IPC data coming from untrusted sources. + +(of course, it is still recommended for security-critical applications + to use cryptographic means of authentication and integrity control -- for + example, to enable TLS with the Flight RPC protocol) + +## How we help the fuzzer find problems + +Fuzzing is a brute force process that tries to devise invalid data to +exercise an implementation's response. By default, the fuzzer does not know +anything about the data representation expected by the program under test. +Fuzzing can therefore be extremely inefficient, testing tons of uninteresting +variations while missing critical ones. + +To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC +files with various data types. By starting from this data and mutating it to +find invalid variations, OSS-Fuzz was able to find tens of issues with data +validation. All of them have been fixed. As of this writing, no new issue +in the IPC layer was found since March 4th 2020. + +## What comes next + +Of course, we still monitor OSS-Fuzz for any new problem that could be found +in the C++ IPC implementation. Such problems might for example appear when adding +features to the Arrow [IPC format](https://arrow.apache.org/docs/format/Columnar.html). + +We have started fuzzing the Parquet C++ implementation. Several issues have +been found and fixed, but more are still coming. We hope to stabilize the +situation in the next month or two. + +The tensor and sparse tensor IPC read paths are not being exercised yet. +They will be once a motivated core developer wants to own the topic.
