Alex Behm has posted comments on this change.

Change subject: IMPALA-3764: fuzz test HDFS scanners
......................................................................


Patch Set 5:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 196:     return Status(Substitute("Hit end of stream after reading $0 
bytes of $1-byte "
Does returning a status here and below honor the abort_on_error flag? I'm fine 
with leaving a TODO if you don't want to fix it in this patch.


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 79:     DCHECK_GE(footer_start, 0);
Ok to leave in this patch, but this should not be DCHECK right?


Line 528:         parse_status_.MergeStatus(Substitute("Corrupted parquet file: 
column had $0 "
Parquet (capital)

Also print the column path's as well as the filename (and elsewhere)


Line 714:   ScopedBuffer metadata_buffer(scan_node_->mem_tracker());
move closer to where it's used


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 97:     return Status(Substitute("Corrupt parquet file: $0 bytes of 
encoded levels but only "
Parquet


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 64:         return Status(Substitute("File $0: metadata is corrupt. 
Dictionary page "
Parquet file '0'


Line 84:     return Status(Substitute("File '$0': metadata is corrupt. Column 
$1 has invalid "
Parquet file '0':


Line 341:     // Sanity-check the schema to avoid allocating any absurdly large 
buffers below.
remove 'any'


Line 347:         "File is likely corrupt", node->element->num_children));
remove "File is likely corrupt" (the err msg already mentioned that)


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 41:   static Status ValidateOffsetInFile(const std::string& filename, int 
column,
column -> col_idx


Line 149:   /// An arbitrary limit on the number of children per schema node we 
support used to
start a new sentence after "support"


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/runtime/scoped-buffer.h
File be/src/runtime/scoped-buffer.h:

Line 30:     Release();
single line


Line 57:   inline uint8_t* buffer() { return buffer_; }
const method


Line 62:   /// The current size of the allocated buffer, if not NULL;
trailing semicolon


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/bit-stream-utils.h
File be/src/util/bit-stream-utils.h:

Line 78:   static const int MAX_SUPPORTED_BITWIDTH = 32;
MAX_BITWIDTH is shorter and equally clear (and below)


http://gerrit.cloudera.org:8080/#/c/3448/5/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

Line 171:     if (UNLIKELY(buffer_len < 1)) return Status("Dictionary cannot be 
0 bytes");
buffer_len == 0 is more obvious


http://gerrit.cloudera.org:8080/#/c/3448/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 600: class TestScannersFuzzing(ImpalaTestSuite):
This is big enough to go into it's own .py file.


Line 602:   BATCH_SIZES = [0, 1, 16, 10000]
include our default batch size 1024


Line 649:     random_seed = os.environ.get("SCANNER_FUZZ_SEED") or time.time()
If we are going to check in this test and run it in our evergreen jobs I think 
we should use a fixed seed. Otherwise, all kinds of jobs (Gvms, etc.) could 
fail because we gut unlucky in this test. Maybe we want to run/integrate this 
with the random query generator or stress test procedure/cadence?


Line 686:     # Execute a query that try to read all the columns and rows in 
the file.
tries to read


Line 688:     query = 'select count(*) from (select distinct * from {0}.{1}) 
q'.format(
can we also just run a count(*) query to see what happens when we materialize 
no columns? for some formats (like Parquet) we use very different code paths


-- 
To view, visit http://gerrit.cloudera.org:8080/3448
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: David Knupp <[email protected]>
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to