Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8936 )

Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile 
scanners
......................................................................


Patch Set 13:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG@14
PS13, Line 14:       with 100 iteration without failure.
We should run it for more than 100 iterations before committing,  I think 
there's still some chance of a rare failure slipping through otherwise.


http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@86
PS13, Line 86:   const int max_ncols = 8000000;
Let's make this consistent with other constant - should be all caps and defined 
up the top of the file.


http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@90
PS13, Line 90:     ss << stream_->filename() << " Invalid file-header-metadata, 
number of columns "
I think this error message is confusing. The metadata isn't invalid, the 
problem is that we've imposed a limit on the maximum number of columns that 
Impala can handle. I think the error message needs to be rewritten to explain 
the limit. It should also include the value of the limit.


http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@363
PS13, Line 363:   ss << stream_->filename() << " Bad rowcolumn index: " << 
col_idx;
Is there meant to be a space between row and column? It seems like we should 
also say something relating to corruption so that it's less cryptic. E.g. 
"Corrupt column at index: "


http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc
File be/src/exec/hdfs-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc@56
PS13, Line 56:   if (codegen == nullptr) {
This DCHECK should be valid - why did this need to be changed? If we're hitting 
the DCHECK it indicates a bug with how we set up codegen and we need to fix the 
bug rather than remove the DCHECK.


http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h@198
PS11, Line 198:   // Buffer access out of bounds.
> Shouldn't this be >=? Since buf[size] is one past the end of the buffer?
Looks like you missed this comment.


http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@452
PS11, Line 452:       if (input_len < 0) {
> Shouldn't we be checking if it's <= sizeof(uint32_t), since that's the amou
Looks like you missed this comment.


http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@454
PS11, Line 454:         ss << " Corruption snappy decomp input_len " << 
input_len;
> Can we make this consistent with the other error handling in this function
Looks like you missed this comment.


http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py@198
PS13, Line 198:         elif 'corrupt' in str(e).lower():
In theory we should just be skipping over corrupt files with abort_on_error=1. 
This doesn't always happen for a lot of file formats. We should just add rc and 
sequence file to the list of formats below on l207 that don't implement this 
skipping behaviour, rather than trying to guess based on the error message.



--
To view, visit http://gerrit.cloudera.org:8080/8936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a
Gerrit-Change-Number: 8936
Gerrit-PatchSet: 13
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: anujphadke <[email protected]>
Gerrit-Comment-Date: Wed, 11 Apr 2018 23:27:36 +0000
Gerrit-HasComments: Yes

Reply via email to