Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8936 )
Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile scanners ...................................................................... Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc@362 PS12, Line 362: if (remain_len <= 0) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : : DCHECK_GT(remain_len, 0); : int bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.buffer_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error" : " GetCurrentKeyBuffer col_idx: " << col_idx << " remaining length: " << remain_len : << " col_info.buffer_len: " << col_info.buffer_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, &col_info.uncompressed_buffer_len, : remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key buffer " : "col_idx: " << col_idx << " remaining length: "<< remain_len; : return Status(ss.str()); : } : *key_buf_ptr += bytes_read; : remain_len -= bytes_read; : DCHECK_GT(remain_len, 0); : : int col_key_buf_len; : bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr , &col_key_buf_len, remain_len); : if (bytes_read == -1) { : stringstream ss; : ss << stream_->filename() << " RCFile corrupt, out of bounds error key " : "buffer col_idx: " << col_idx << " remaining length: "<< remain_len; : return Status(ss.str()); : } these error messages seem very tailored to Impala's internal bookkeeping. Should we rephrase them to be more in terms of the RC file format, so that they make sense without looking at Impala code? i.e. so they could be interpreted by whoever is supporting the tool that's producing these files? Also, should we be dedup'ing these message (i.e. assigning an error code) or do we not expect that to be worthwhile? (Also, using Substitute() may make this more readable, but you can decide which you prefer.) http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc File be/src/exec/hdfs-sequence-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc@233 PS12, Line 233: if (i > record_locations_.size()) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted, record_locations_.size() " : << record_locations_.size() << " is less than curr_idx: " << i; : return Status(ss.str()); : } : int bytes_read = ReadWriteUtil::GetVLong( : next_record_in_compressed_block_, &record_locations_[i].len, : next_record_in_compressed_block_len_); : if (UNLIKELY(bytes_read == -1)) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted, Invalid record sizes in " : "compressed block "<< " rec_len: " << record_locations_[i].len : << " next_record_in_compressed_block_len_ " : << next_record_in_compressed_block_len_; : return Status(ss.str()); : } : next_record_in_compressed_block_ += bytes_read; : next_record_in_compressed_block_len_ -= bytes_read; : if (next_record_in_compressed_block_len_ <= 0) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted," : " next_record_in_compressed_block_len_ le 0" << " comp_blk_len: " : << next_record_in_compressed_block_len_; : return Status(ss.str()); : } : record_locations_[i].record = next_record_in_compressed_block_; : next_record_in_compressed_block_ += record_locations_[i].len; : next_record_in_compressed_block_len_ -= record_locations_[i].len; : if (next_record_in_compressed_block_len_ < 0) { : stringstream ss; : ss << stream_->filename() << " SeqFile corrupted, " : << "next_record_in_compressed_block_len_ lt 0" : << " comp_blk_len: " << next_record_in_compressed_block_len_; : return Status(ss.str()); : } these are also written in terms of Impala internals. Contrast that with the error messages already in this file that are worded in terms of the file format (e.g. "Expecting sync indicator (-1) at file offset ...") -- To view, visit http://gerrit.cloudera.org:8080/8936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a Gerrit-Change-Number: 8936 Gerrit-PatchSet: 12 Gerrit-Owner: Pranay Singh Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Pranay Singh Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: anujphadke <[email protected]> Gerrit-Comment-Date: Wed, 14 Mar 2018 00:15:46 +0000 Gerrit-HasComments: Yes
