Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3441: check for malformed Avro data
......................................................................


Patch Set 5:

(26 comments)

It's looking pretty good, I think mainly we could add (even more) test 
coverage, then I'm concerned about getting back some or all of the perf 
regression. I had some ideas that I mentioned inline.

http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

Line 46:  uint8_t** data
It may be worth adding a __restrict__ to some of these pointer args. Otherwise 
it won't be able to cache *data_len in a register, since the write to *data may 
overwrite *data_len.


Line 60:   if (union_position == 2) union_position = 1;
Could implement this as a bitshift 1 bit to the right. It's possible the 
compiler already does this.


Line 69:   if (*data_len == 0) {
Should probably add UNLIKELY() annotations to these error conditions. Not sure 
how much of a difference it will make.


Line 133: 4
sizeof(float)? or a named constant


Line 146:   }
sizeof(float)?


Line 154:   if (*data_len < 8) {
sizeof(double)? or a named constant


Line 161:   }
sizeof(double)?


Line 170:   if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) {
Can we factor out the string checks into a separate function? 
ReadAvroVarChar(), ReadAvroChar() and ReadAvroString() all share the same logic.


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc
File be/src/exec/hdfs-avro-scanner-test.cc:

Line 24: // TODO: CHAR, VARCHAR
Maybe create a JIRA and elaborate on what coverage is missing if it's not going 
to be included in the patch.


Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) {
This only exercises the special case when the union has two members, one of 
which is null, right? Maybe add a short comment to clarify.


Line 187:   TestReadAvroInt32(data, 0, -1, -1, 
TErrorCode::AVRO_TRUNCATED_BLOCK);
Add a test for the upper bounds maybe, e.g. max int and min it?


Line 190:   // TODO: we don't handle invalid values (e.g. overflow)
Could we add a test to exercise the code path at least? Even if we don't have a 
true "expected" value.


Line 208:   TestReadAvroInt64(data, 10, 64, 2);
Add a test for the upper bounds maybe, e.g. max int and min it?


Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) {
There's the funky case where in avro it's a float, but we're expecting a 
double: can you add coverage for that?


Line 218:   *reinterpret_cast<float*>(data) = f;
memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is ok.


Line 230:   *reinterpret_cast<double*>(data) = d;
memcpy?


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

Line 73:     codegend_decode_avro_data_(NULL) {
DCHECK that it's a test?


Line 536:     MemPool* pool, uint8_t** data, int64_t* data_len, Tuple* tuple) {
I think it might be slightly better to provide a data_end pointer instead of a 
data_len, since it means one fewer mutable argument and I think will work out 
to slightly fewer instructions. I don't feel strongly about it though, I'm not 
sure it's worth going through the effort of changing it everywhere.


Line 603:         // TODO: handle return value
Are you planning to do this in this patchset?


Line 666: //   br i1 %success, label %end_field, label %bail_out
It may be worth adding branch-weight metadata to these branches (the equivalent 
of UNLIKELY annotations). That may help get back some of the perf regression. 
It looks like there's support for it: 
http://llvm.org/docs/BranchWeightMetadata.html . In LowerExpectIntrinsic.cpp 
there's some usage of it that looks relatively straightforward:


  MDBuilder MDB(CI->getContext());
  MDNode *Node;

  // If expect value is equal to 1 it means that we are more likely to take
  // branch 0, in other case more likely is branch 1.
  if (ExpectedValue->isOne())
    Node = MDB.createBranchWeights(LikelyBranchWeight, UnlikelyBranchWeight);
  else
    Node = MDB.createBranchWeights(UnlikelyBranchWeight, LikelyBranchWeight);

  BI.setMetadata(LLVMContext::MD_prof, Node);


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc
File be/src/exec/read-write-util.cc:

Line 24: int64_t* buf_len, int64_t* result
Should add a __restrict__ for these args so they can stay in registers. Or 
alternatively explicitly cache them in locals.

We could also consider adding a fast path where we check whether we have enough 
buffer remaining to always read the full integer. I'm not sure if there's much 
benefit to doing this per-field. Maybe if we could do the check once-per-tuple, 
but I guess that doesn't work if you have var-len data like strings.


Line 29:     if (*buf_len <= 0) return false;
UNLIKELY()


Line 33:     more = (**buf & 0x80) != 0;
What is the expected behaviour if the encoded int doesn't fit in a long?


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc
File be/src/exec/zigzag-test.cc:

Line 61: // TODO: test error cases
Let's add some basic tests at least for overflow, etc.


http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

Line 86: test_hdfs_rcfile_scan_node_errors
Missed updating the test name


Line 88:     self.run_test_case('DataErrorsTest/avro-errors', vector)
I don't see this file


-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to