Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 )
Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners ...................................................................... Patch Set 13: (9 comments) http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8936/13//COMMIT_MSG@14 PS13, Line 14: with 100 iteration without failure. We should run it for more than 100 iterations before committing, I think there's still some chance of a rare failure slipping through otherwise. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@86 PS13, Line 86: const int max_ncols = 8000000; Let's make this consistent with other constant - should be all caps and defined up the top of the file. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@90 PS13, Line 90: ss << stream_->filename() << " Invalid file-header-metadata, number of columns " I think this error message is confusing. The metadata isn't invalid, the problem is that we've imposed a limit on the maximum number of columns that Impala can handle. I think the error message needs to be rewritten to explain the limit. It should also include the value of the limit. http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-rcfile-scanner.cc@363 PS13, Line 363: ss << stream_->filename() << " Bad rowcolumn index: " << col_idx; Is there meant to be a space between row and column? It seems like we should also say something relating to corruption so that it's less cryptic. E.g. "Corrupt column at index: " http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/13/be/src/exec/hdfs-sequence-scanner.cc@56 PS13, Line 56: if (codegen == nullptr) { This DCHECK should be valid - why did this need to be changed? If we're hitting the DCHECK it indicates a bug with how we set up codegen and we need to fix the bug rather than remove the DCHECK. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/exec/read-write-util.h@198 PS11, Line 198: // Buffer access out of bounds. > Shouldn't this be >=? Since buf[size] is one past the end of the buffer? Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc File be/src/util/decompress.cc: http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@452 PS11, Line 452: if (input_len < 0) { > Shouldn't we be checking if it's <= sizeof(uint32_t), since that's the amou Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/11/be/src/util/decompress.cc@454 PS11, Line 454: ss << " Corruption snappy decomp input_len " << input_len; > Can we make this consistent with the other error handling in this function Looks like you missed this comment. http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/13/tests/query_test/test_scanners_fuzz.py@198 PS13, Line 198: elif 'corrupt' in str(e).lower(): In theory we should just be skipping over corrupt files with abort_on_error=1. This doesn't always happen for a lot of file formats. We should just add rc and sequence file to the list of formats below on l207 that don't implement this skipping behaviour, rather than trying to guess based on the error message. -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 13 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: anujphadke <[email protected]> Gerrit-Comment-Date: Wed, 11 Apr 2018 23:27:36 +0000 Gerrit-HasComments: Yes
