[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners

Tim Armstrong (Code Review) Thu, 18 Jan 2018 11:53:01 -0800

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8936 )


Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile 
scanners
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@329
PS3, Line 329:       << " key_length_: " << key_length_;
Maybe also include the file name in the messages? So that if this happens in 
the wild we can figure out which file is corrupt.


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@334
PS3, Line 334:   int col_idx = 0;
It doesn't look like moving the col_idx declaration out of the loop is needed, 
right?


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@337
PS3, Line 337: key_length_
Passing key_length_ seems wrong, since it measures the number of bytes in 
key_buffer, not the bytes remaining after key_buf_ptr.


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-rcfile-scanner.cc@464
PS3, Line 464:     if (column.uncompressed_buffer_len + column.start_offset > 
row_group_length_) {
Can you add a brief comment explaining what this is checking for? E.g. "Ensure 
there is enough space in 'row_group_buffer_' for the uncompressed data".


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h
File be/src/exec/hdfs-sequence-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.h@257
PS3, Line 257: next_record_in_compressed_block_len
Needs an underscore in the end of the variable name.


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc
File be/src/exec/hdfs-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/hdfs-sequence-scanner.cc@235
PS3, Line 235:       ss << "SeqFile corrupted, record_locations_.size() " << 
record_locations_.size()
Would be good to include file names so it's easier to root cause in the field.


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc
File be/src/exec/read-write-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util-test.cc@115
PS3, Line 115: }
How about we add a few unit tests to make sure that passing in a too-short 
buffer results in an error?


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@58
PS3, Line 58:   /// If the size byte is corrupted then return -1;
Can you document the size argument?


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@72
PS3, Line 72:   static int GetVLong(uint8_t* buf, int64_t offset, int64_t* 
vlong, int32_t size);
Can you document the size argument?


http://gerrit.cloudera.org:8080/#/c/8936/3/be/src/exec/read-write-util.h@198
PS3, Line 198:   if (len > MAX_VINT_LEN) return -1;
I think we can check len vs size here and save doing it in the loop below. 
Would be clearer to have the two length checks in the same place.


http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/8936/3/tests/query_test/test_scanners_fuzz.py@203
PS3, Line 203:         elif 'corrupt' in str(e).lower():
The previous two aren't needed, right, since 'corrupt' is a substring of those?



--
To view, visit http://gerrit.cloudera.org:8080/8936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a
Gerrit-Change-Number: 8936
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: anujphadke <[email protected]>
Gerrit-Comment-Date: Thu, 18 Jan 2018 19:52:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners

Reply via email to