Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 )
Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners ...................................................................... Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@329 PS3, Line 329: << " key_length_: " << key_length_; Maybe also include the file name in the messages? So that if this happens in the wild we can figure out which file is corrupt. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@334 PS3, Line 334: int col_idx = 0; It doesn't look like moving the col_idx declaration out of the loop is needed, right? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@337 PS3, Line 337: key_length_ Passing key_length_ seems wrong, since it measures the number of bytes in key_buffer, not the bytes remaining after key_buf_ptr. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@464 PS3, Line 464: if (column.uncompressed_buffer_len + column.start_offset > row_group_length_) { Can you add a brief comment explaining what this is checking for? E.g. "Ensure there is enough space in 'row_group_buffer_' for the uncompressed data". http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h File be/src/exec/hdfs-sequence-scanner.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h@257 PS3, Line 257: next_record_in_compressed_block_len Needs an underscore in the end of the variable name. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc@235 PS3, Line 235: ss << "SeqFile corrupted, record_locations_.size() " << record_locations_.size() Would be good to include file names so it's easier to root cause in the field. http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc File be/src/exec/read-write-util-test.cc: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc@115 PS3, Line 115: } How about we add a few unit tests to make sure that passing in a too-short buffer results in an error? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@58 PS3, Line 58: /// If the size byte is corrupted then return -1; Can you document the size argument? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@72 PS3, Line 72: static int GetVLong(uint8_t* buf, int64_t offset, int64_t* vlong, int32_t size); Can you document the size argument? http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@198 PS3, Line 198: if (len > MAX_VINT_LEN) return -1; I think we can check len vs size here and save doing it in the loop below. Would be clearer to have the two length checks in the same place. http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py@203 PS3, Line 203: elif 'corrupt' in str(e).lower(): The previous two aren't needed, right, since 'corrupt' is a substring of those? -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 3 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: anujphadke <[email protected]> Gerrit-Comment-Date: Thu, 18 Jan 2018 19:52:49 +0000 Gerrit-HasComments: Yes
