[Impala-CR](cdh5-trunk) IMPALA-3764: fuzz test HDFS scanners

Tim Armstrong (Code Review) Wed, 20 Jul 2016 13:21:25 -0700

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3764: fuzz test HDFS scanners
......................................................................



Patch Set 5:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 196:     return Status(Substitute("Hit end of stream after reading $0 
bytes of $1-byte "
> Does returning a status here and below honor the abort_on_error flag? I'm f
I confirmed it gets bubbled up to BaseSequenceScanner::ProcessSplit(), which 
logs the error.

We actually test this with the fuzz testing, since we run with abort_on_error=0 
and expect no exceptions (aside from Parquet, which has a couple of bugs of 
this nature).


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 79:     DCHECK_GE(footer_start, 0);
> Ok to leave in this patch, but this should not be DCHECK right?
I believe if the above calculation is correct, it should never be < 0 because 
of the min().


Line 528:         parse_status_.MergeStatus(Substitute("Corrupted parquet file: 
column had $0 "
> Parquet (capital)
Done


Line 714:   ScopedBuffer metadata_buffer(scan_node_->mem_tracker());
> move closer to where it's used
I can't move it into the if statement since it needs to live longer than that 
(it's referenced via metadata_ptr)


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 97:     return Status(Substitute("Corrupt parquet file: $0 bytes of 
encoded levels but only "
> Parquet
Done


Line 929:   if (UNLIKELY(def_level_ < 0 || def_level_ > max_def_level())) {
Updated the message for this one..


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 64:         return Status(Substitute("File $0: metadata is corrupt. 
Dictionary page "
> Parquet file '0'
Done


Line 84:     return Status(Substitute("File '$0': metadata is corrupt. Column 
$1 has invalid "
> Parquet file '0':
Done


Line 341:     // Sanity-check the schema to avoid allocating any absurdly large 
buffers below.
> remove 'any'
Done


Line 347:         "File is likely corrupt", node->element->num_children));
> remove "File is likely corrupt" (the err msg already mentioned that)
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 41:   static Status ValidateOffsetInFile(const std::string& filename, int 
column,
> column -> col_idx
Done


Line 149:   /// An arbitrary limit on the number of children per schema node we 
support used to
> start a new sentence after "support"
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/runtime/scoped-buffer.h
File be/src/runtime/scoped-buffer.h:

Line 30:     Release();
> single line
Done


Line 57:   inline uint8_t* buffer() { return buffer_; }
> const method
Done


Line 62:   /// The current size of the allocated buffer, if not NULL;
> trailing semicolon
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

Line 78:   static const int MAX_SUPPORTED_BITWIDTH = 32;
> MAX_BITWIDTH is shorter and equally clear (and below)
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

Line 171:     if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 
0 bytes");
> buffer_len == 0 is more obvious
Done


http://gerrit.cloudera.org:8080/#/c/3448/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 600: class TestScannersFuzzing(ImpalaTestSuite):
> This is big enough to go into it's own .py file.
Done


Line 602:   BATCH_SIZES = [0, 1, 16, 10000]
> include our default batch size 1024
0 means the default batch size.


Line 649:     random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time()
> If we are going to check in this test and run it in our evergreen jobs I th
We can discuss this a bit in person but my feeling is that it's better to use a 
random seed to get the extra coverage, and that if we see breakages with any 
frequency it's a cultural problem with inadequate precommit testing of scanner 
changes.

This is easy to run locally and that it is testing a well-defined part of the 
codebase (the scanners), so the onus should be on patch authors to run this as 
part of pre-commit testing enough times to be confident that they haven't 
introduced any bugs.

I see the stress testing and query generator as slightly different since 
they're harder to run pre-commit and it's not always obvious which changes may 
introduce relevant bugs.

If something slips through, I'd prefer the extra coverage we get from running 
this as part of all our tests with a random seed. Otherwise the coverage is 
pretty limited to a small subset of potential errors. My experience running 
this has been that you'd catch maybe 50% of bugs in the first run, 90% from 
running it for a few iterations, and the remainder running it for a few hours.

We also want to run this under all the different configs (S3, release, ASAN, 
etc), so there's a lot of overhead to set up new jobs for all of those. It 
might be useful to have a job that just runs this in a loop, but I think that's 
a separate thing.


Line 686:     # Execute a query that try to read all the columns and rows in 
the file.
> tries to read
Done


Line 688:     query = 'select count(*) from (select distinct * from {0}.{1}) 
q'.format(
> can we also just run a count(*) query to see what happens when we materiali
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3448
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3764: fuzz test HDFS scanners

Reply via email to