Tim Armstrong has posted comments on this change. Change subject: IMPALA-3441: check for malformed Avro data ......................................................................
Patch Set 5: (26 comments) It's looking pretty good, I think mainly we could add (even more) test coverage, then I'm concerned about getting back some or all of the perf regression. I had some ideas that I mentioned inline. http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: Line 46: uint8_t** data It may be worth adding a __restrict__ to some of these pointer args. Otherwise it won't be able to cache *data_len in a register, since the write to *data may overwrite *data_len. Line 60: if (union_position == 2) union_position = 1; Could implement this as a bitshift 1 bit to the right. It's possible the compiler already does this. Line 69: if (*data_len == 0) { Should probably add UNLIKELY() annotations to these error conditions. Not sure how much of a difference it will make. Line 133: 4 sizeof(float)? or a named constant Line 146: } sizeof(float)? Line 154: if (*data_len < 8) { sizeof(double)? or a named constant Line 161: } sizeof(double)? Line 170: if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) { Can we factor out the string checks into a separate function? ReadAvroVarChar(), ReadAvroChar() and ReadAvroString() all share the same logic. http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc File be/src/exec/hdfs-avro-scanner-test.cc: Line 24: // TODO: CHAR, VARCHAR Maybe create a JIRA and elaborate on what coverage is missing if it's not going to be included in the patch. Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) { This only exercises the special case when the union has two members, one of which is null, right? Maybe add a short comment to clarify. Line 187: TestReadAvroInt32(data, 0, -1, -1, TErrorCode::AVRO_TRUNCATED_BLOCK); Add a test for the upper bounds maybe, e.g. max int and min it? Line 190: // TODO: we don't handle invalid values (e.g. overflow) Could we add a test to exercise the code path at least? Even if we don't have a true "expected" value. Line 208: TestReadAvroInt64(data, 10, 64, 2); Add a test for the upper bounds maybe, e.g. max int and min it? Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) { There's the funky case where in avro it's a float, but we're expecting a double: can you add coverage for that? Line 218: *reinterpret_cast<float*>(data) = f; memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is ok. Line 230: *reinterpret_cast<double*>(data) = d; memcpy? http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: Line 73: codegend_decode_avro_data_(NULL) { DCHECK that it's a test? Line 536: MemPool* pool, uint8_t** data, int64_t* data_len, Tuple* tuple) { I think it might be slightly better to provide a data_end pointer instead of a data_len, since it means one fewer mutable argument and I think will work out to slightly fewer instructions. I don't feel strongly about it though, I'm not sure it's worth going through the effort of changing it everywhere. Line 603: // TODO: handle return value Are you planning to do this in this patchset? Line 666: // br i1 %success, label %end_field, label %bail_out It may be worth adding branch-weight metadata to these branches (the equivalent of UNLIKELY annotations). That may help get back some of the perf regression. It looks like there's support for it: http://llvm.org/docs/BranchWeightMetadata.html . In LowerExpectIntrinsic.cpp there's some usage of it that looks relatively straightforward: MDBuilder MDB(CI->getContext()); MDNode *Node; // If expect value is equal to 1 it means that we are more likely to take // branch 0, in other case more likely is branch 1. if (ExpectedValue->isOne()) Node = MDB.createBranchWeights(LikelyBranchWeight, UnlikelyBranchWeight); else Node = MDB.createBranchWeights(UnlikelyBranchWeight, LikelyBranchWeight); BI.setMetadata(LLVMContext::MD_prof, Node); http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc File be/src/exec/read-write-util.cc: Line 24: int64_t* buf_len, int64_t* result Should add a __restrict__ for these args so they can stay in registers. Or alternatively explicitly cache them in locals. We could also consider adding a fast path where we check whether we have enough buffer remaining to always read the full integer. I'm not sure if there's much benefit to doing this per-field. Maybe if we could do the check once-per-tuple, but I guess that doesn't work if you have var-len data like strings. Line 29: if (*buf_len <= 0) return false; UNLIKELY() Line 33: more = (**buf & 0x80) != 0; What is the expected behaviour if the encoded int doesn't fit in a long? http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc File be/src/exec/zigzag-test.cc: Line 61: // TODO: test error cases Let's add some basic tests at least for overflow, etc. http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py File tests/data_errors/test_data_errors.py: Line 86: test_hdfs_rcfile_scan_node_errors Missed updating the test name Line 88: self.run_test_case('DataErrorsTest/avro-errors', vector) I don't see this file -- To view, visit http://gerrit.cloudera.org:8080/3072 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
