Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8936 )

Change subject: IMPALA-3833: Fix invalid data handling in Sequence and RCFile 
scanners
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-rcfile-scanner.cc@362
PS12, Line 362:   if (remain_len <= 0) {
              :     stringstream ss;
              :     ss << stream_->filename() << " RCFile corrupt, out of 
bounds error"
              :       " GetCurrentKeyBuffer col_idx: " << col_idx << " 
remaining length: " << remain_len
              :       << " col_info.buffer_len: " << col_info.buffer_len;
              :     return Status(ss.str());
              :   }
              :
              :   DCHECK_GT(remain_len, 0);
              :   int bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, 
&col_info.buffer_len, remain_len);
              :   if (bytes_read == -1) {
              :     stringstream ss;
              :     ss << stream_->filename() << " RCFile corrupt, out of 
bounds error"
              :       " GetCurrentKeyBuffer col_idx: " << col_idx << " 
remaining length: " << remain_len
              :       << " col_info.buffer_len: " << col_info.buffer_len;
              :     return Status(ss.str());
              :   }
              :   *key_buf_ptr += bytes_read;
              :   remain_len -= bytes_read;
              :   DCHECK_GT(remain_len, 0);
              :
              :   bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr, 
&col_info.uncompressed_buffer_len,
              :       remain_len);
              :   if (bytes_read == -1) {
              :     stringstream ss;
              :     ss << stream_->filename() << " RCFile corrupt, out of 
bounds error key buffer "
              :       "col_idx: " << col_idx << " remaining length: "<< 
remain_len;
              :     return Status(ss.str());
              :   }
              :   *key_buf_ptr += bytes_read;
              :   remain_len -= bytes_read;
              :   DCHECK_GT(remain_len, 0);
              :
              :   int col_key_buf_len;
              :   bytes_read = ReadWriteUtil::GetVInt(*key_buf_ptr , 
&col_key_buf_len, remain_len);
              :   if (bytes_read == -1) {
              :     stringstream ss;
              :     ss << stream_->filename() << " RCFile corrupt, out of 
bounds error key "
              :       "buffer col_idx: " << col_idx << " remaining length: "<< 
remain_len;
              :     return Status(ss.str());
              :   }
these error messages seem very tailored to Impala's internal bookkeeping. 
Should we rephrase them to be more in terms of the RC file format, so that they 
make sense without looking at Impala code? i.e. so they could be interpreted by 
whoever is supporting the tool that's producing these files?  Also, should we 
be dedup'ing these message (i.e. assigning an error code) or do we not expect 
that to be worthwhile?

(Also, using Substitute() may make this more readable, but you can decide which 
you prefer.)


http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc
File be/src/exec/hdfs-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8936/12/be/src/exec/hdfs-sequence-scanner.cc@233
PS12, Line 233:     if (i > record_locations_.size()) {
              :       stringstream ss;
              :       ss << stream_->filename() << " SeqFile corrupted, 
record_locations_.size() "
              :       << record_locations_.size() << " is less than curr_idx: " 
<< i;
              :       return Status(ss.str());
              :     }
              :     int bytes_read = ReadWriteUtil::GetVLong(
              :         next_record_in_compressed_block_, 
&record_locations_[i].len,
              :         next_record_in_compressed_block_len_);
              :     if (UNLIKELY(bytes_read == -1)) {
              :       stringstream ss;
              :       ss << stream_->filename() << " SeqFile corrupted, Invalid 
record sizes in "
              :         "compressed block "<< " rec_len: " << 
record_locations_[i].len
              :         << " next_record_in_compressed_block_len_ "
              :         << next_record_in_compressed_block_len_;
              :       return Status(ss.str());
              :     }
              :     next_record_in_compressed_block_ += bytes_read;
              :     next_record_in_compressed_block_len_ -= bytes_read;
              :     if (next_record_in_compressed_block_len_ <= 0) {
              :       stringstream ss;
              :       ss << stream_->filename() << " SeqFile corrupted,"
              :         " next_record_in_compressed_block_len_ le 0" << " 
comp_blk_len: "
              :         << next_record_in_compressed_block_len_;
              :       return Status(ss.str());
              :     }
              :     record_locations_[i].record = 
next_record_in_compressed_block_;
              :     next_record_in_compressed_block_ += 
record_locations_[i].len;
              :     next_record_in_compressed_block_len_ -= 
record_locations_[i].len;
              :     if (next_record_in_compressed_block_len_ < 0) {
              :       stringstream ss;
              :       ss << stream_->filename() << " SeqFile corrupted, "
              :         << "next_record_in_compressed_block_len_ lt 0"
              :         << " comp_blk_len: " << 
next_record_in_compressed_block_len_;
              :       return Status(ss.str());
              :     }
these are also written in terms of Impala internals.  Contrast that with the 
error messages already in this file that are worded in terms of the file format 
(e.g. "Expecting sync indicator (-1) at file offset ...")



--
To view, visit http://gerrit.cloudera.org:8080/8936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic9cfc38af3f30c65ada9734eb471dbfa6ecdd74a
Gerrit-Change-Number: 8936
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: anujphadke <apha...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Mar 2018 00:15:46 +0000
Gerrit-HasComments: Yes

Reply via email to