Alex Behm has posted comments on this change. Change subject: IMPALA-3764: fuzz test HDFS scanners ......................................................................
Patch Set 5: (21 comments) http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 196: return Status(Substitute("Hit end of stream after reading $0 bytes of $1-byte " Does returning a status here and below honor the abort_on_error flag? I'm fine with leaving a TODO if you don't want to fix it in this patch. http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 79: DCHECK_GE(footer_start, 0); Ok to leave in this patch, but this should not be DCHECK right? Line 528: parse_status_.MergeStatus(Substitute("Corrupted parquet file: column had $0 " Parquet (capital) Also print the column path's as well as the filename (and elsewhere) Line 714: ScopedBuffer metadata_buffer(scan_node_->mem_tracker()); move closer to where it's used http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 97: return Status(Substitute("Corrupt parquet file: $0 bytes of encoded levels but only " Parquet http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 64: return Status(Substitute("File $0: metadata is corrupt. Dictionary page " Parquet file '0' Line 84: return Status(Substitute("File '$0': metadata is corrupt. Column $1 has invalid " Parquet file '0': Line 341: // Sanity-check the schema to avoid allocating any absurdly large buffers below. remove 'any' Line 347: "File is likely corrupt", node->element->num_children)); remove "File is likely corrupt" (the err msg already mentioned that) http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.h File be/src/exec/parquet-metadata-utils.h: Line 41: static Status ValidateOffsetInFile(const std::string& filename, int column, column -> col_idx Line 149: /// An arbitrary limit on the number of children per schema node we support used to start a new sentence after "support" http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/runtime/scoped-buffer.h File be/src/runtime/scoped-buffer.h: Line 30: Release(); single line Line 57: inline uint8_t* buffer() { return buffer_; } const method Line 62: /// The current size of the allocated buffer, if not NULL; trailing semicolon http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/bit-stream-utils.h File be/src/util/bit-stream-utils.h: Line 78: static const int MAX_SUPPORTED_BITWIDTH = 32; MAX_BITWIDTH is shorter and equally clear (and below) http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: Line 171: if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 0 bytes"); buffer_len == 0 is more obvious http://gerrit.cloudera.org:8080/#/c/3448/5/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 600: class TestScannersFuzzing(ImpalaTestSuite): This is big enough to go into it's own .py file. Line 602: BATCH_SIZES = [0, 1, 16, 10000] include our default batch size 1024 Line 649: random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time() If we are going to check in this test and run it in our evergreen jobs I think we should use a fixed seed. Otherwise, all kinds of jobs (Gvms, etc.) could fail because we gut unlucky in this test. Maybe we want to run/integrate this with the random query generator or stress test procedure/cadence? Line 686: # Execute a query that try to read all the columns and rows in the file. tries to read Line 688: query = 'select count(*) from (select distinct * from {0}.{1}) q'.format( can we also just run a count(*) query to see what happens when we materialize no columns? for some formats (like Parquet) we use very different code paths -- To view, visit http://gerrit.cloudera.org:8080/3448 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: David Knupp <[email protected]> Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
