Tim Armstrong has posted comments on this change. Change subject: IMPALA-3764: fuzz test HDFS scanners ......................................................................
Patch Set 5: (22 comments) http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 196: return Status(Substitute("Hit end of stream after reading $0 bytes of $1-byte " > Does returning a status here and below honor the abort_on_error flag? I'm f I confirmed it gets bubbled up to BaseSequenceScanner::ProcessSplit(), which logs the error. We actually test this with the fuzz testing, since we run with abort_on_error=0 and expect no exceptions (aside from Parquet, which has a couple of bugs of this nature). http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 79: DCHECK_GE(footer_start, 0); > Ok to leave in this patch, but this should not be DCHECK right? I believe if the above calculation is correct, it should never be < 0 because of the min(). Line 528: parse_status_.MergeStatus(Substitute("Corrupted parquet file: column had $0 " > Parquet (capital) Done Line 714: ScopedBuffer metadata_buffer(scan_node_->mem_tracker()); > move closer to where it's used I can't move it into the if statement since it needs to live longer than that (it's referenced via metadata_ptr) http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 97: return Status(Substitute("Corrupt parquet file: $0 bytes of encoded levels but only " > Parquet Done Line 929: if (UNLIKELY(def_level_ < 0 || def_level_ > max_def_level())) { Updated the message for this one.. http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 64: return Status(Substitute("File $0: metadata is corrupt. Dictionary page " > Parquet file '0' Done Line 84: return Status(Substitute("File '$0': metadata is corrupt. Column $1 has invalid " > Parquet file '0': Done Line 341: // Sanity-check the schema to avoid allocating any absurdly large buffers below. > remove 'any' Done Line 347: "File is likely corrupt", node->element->num_children)); > remove "File is likely corrupt" (the err msg already mentioned that) Done http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.h File be/src/exec/parquet-metadata-utils.h: Line 41: static Status ValidateOffsetInFile(const std::string& filename, int column, > column -> col_idx Done Line 149: /// An arbitrary limit on the number of children per schema node we support used to > start a new sentence after "support" Done http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/runtime/scoped-buffer.h File be/src/runtime/scoped-buffer.h: Line 30: Release(); > single line Done Line 57: inline uint8_t* buffer() { return buffer_; } > const method Done Line 62: /// The current size of the allocated buffer, if not NULL; > trailing semicolon Done http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/bit-stream-utils.h File be/src/util/bit-stream-utils.h: Line 78: static const int MAX_SUPPORTED_BITWIDTH = 32; > MAX_BITWIDTH is shorter and equally clear (and below) Done http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: Line 171: if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 0 bytes"); > buffer_len == 0 is more obvious Done http://gerrit.cloudera.org:8080/#/c/3448/5/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 600: class TestScannersFuzzing(ImpalaTestSuite): > This is big enough to go into it's own .py file. Done Line 602: BATCH_SIZES = [0, 1, 16, 10000] > include our default batch size 1024 0 means the default batch size. Line 649: random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time() > If we are going to check in this test and run it in our evergreen jobs I th We can discuss this a bit in person but my feeling is that it's better to use a random seed to get the extra coverage, and that if we see breakages with any frequency it's a cultural problem with inadequate precommit testing of scanner changes. This is easy to run locally and that it is testing a well-defined part of the codebase (the scanners), so the onus should be on patch authors to run this as part of pre-commit testing enough times to be confident that they haven't introduced any bugs. I see the stress testing and query generator as slightly different since they're harder to run pre-commit and it's not always obvious which changes may introduce relevant bugs. If something slips through, I'd prefer the extra coverage we get from running this as part of all our tests with a random seed. Otherwise the coverage is pretty limited to a small subset of potential errors. My experience running this has been that you'd catch maybe 50% of bugs in the first run, 90% from running it for a few iterations, and the remainder running it for a few hours. We also want to run this under all the different configs (S3, release, ASAN, etc), so there's a lot of overhead to set up new jobs for all of those. It might be useful to have a job that just runs this in a loop, but I think that's a separate thing. Line 686: # Execute a query that try to read all the columns and rows in the file. > tries to read Done Line 688: query = 'select count(*) from (select distinct * from {0}.{1}) q'.format( > can we also just run a count(*) query to see what happens when we materiali Done -- To view, visit http://gerrit.cloudera.org:8080/3448 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: David Knupp <[email protected]> Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
