Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-3764,3914: fuzz test HDFS scanners and fix parquet bugs found ......................................................................
IMPALA-3764,3914: fuzz test HDFS scanners and fix parquet bugs found This adds a test that performs some simple fuzz testing of HDFS scanners. It creates a copy of a given HDFS table, with each file in the table corrupted in a random way: either a single byte is set to a random value, or the file is truncated to a random length. It then runs a query that scans the whole table with several different batch_size settings. I made some effort to make the failures reproducible by explicitly seeding the random number generator, and providing a mechanism to override the seed. The fuzzer has found crashes resulting from corrupted or truncated input files for RCFile, SequenceFile, Parquet, and Text LZO so far. Avro only had a small buffer read overrun detected by ASAN. Includes fixes for Parquet crashes found by the fuzzer, a small buffer overrun in Avro, and a DCHECK in MemPool. Initially it is only enabled for Avro, Parquet, and uncompressed text. As follow-up work we should fix the bugs in the other scanners and enable the test for them. We also don't implement abort_on_error=0 correctly in Parquet: for some file formats, corrupt headers result in the query being aborted, so an exception will xfail the test. Testing: Ran the test with exploration_strategy=exhaustive in a loop locally with both DEBUG and ASAN builds for a couple of days over a weekend. Also ran exhaustive private build. Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b Reviewed-on: http://gerrit.cloudera.org:8080/3833 Reviewed-by: Tim Armstrong <[email protected]> Tested-by: Internal Jenkins --- M be/src/exec/base-sequence-scanner.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/runtime/disk-io-mgr.cc A be/src/runtime/scoped-buffer.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h M be/src/util/compress.cc M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M be/src/util/rle-encoding.h M be/src/util/rle-test.cc M testdata/workloads/functional-query/queries/QueryTest/parquet.test M tests/common/impala_test_suite.py M tests/query_test/test_scanners.py A tests/query_test/test_scanners_fuzz.py 19 files changed, 427 insertions(+), 56 deletions(-) Approvals: Internal Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/3833 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Tim Armstrong <[email protected]>
