Skye Wanderman-Milne has posted comments on this change. Change subject: IMPALA-3441: check for malformed Avro data ......................................................................
Patch Set 5: (24 comments) Dan, we could probably use more codegen testing, but by inspection of the code we are testing the relevant paths. I.e. we don't test failure of every type with codegen'd code, but most of the codegen'd paths for different types are the same and we test all of them. Tim, I've addressed most of your comments, but haven't tried two of your suggestions in hdfs-avro-scanner.cc (passing data_end instead of data_len and branch weight metadata to the codegen'd code). I'll work on those next, but I'll post what I have so you can review the other changes in parallel. Similarly, I'm not sure where we spend the extra time yet, but I'll investigate. I'll have to do it by removing code since all the changes are in codegen'd code, so I can't profile normally. ReadZLong would be my guess too. http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: Line 46: uint8_t** data > It may be worth adding a __restrict__ to some of these pointer args. Otherw It doesn't seem to have an effect, probably because everything's inlined with codegen (or I did something wrong, I just added __restrict__ to every data and data_len pointer in this file, lemme know if I should do something else). Line 60: if (union_position == 2) union_position = 1; > Could implement this as a bitshift 1 bit to the right. It's possible the co Also didn't have an effect on my benchmark, hopefully the compiler is doing this. Line 69: if (*data_len == 0) { > Should probably add UNLIKELY() annotations to these error conditions. Not s Done Line 133: 4 > sizeof(float)? or a named constant Done Line 146: } > sizeof(float)? Done Line 154: if (*data_len < 8) { > sizeof(double)? or a named constant Done Line 161: } > sizeof(double)? Done Line 170: if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) { > Can we factor out the string checks into a separate function? ReadAvroVarCh Done http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc File be/src/exec/hdfs-avro-scanner-test.cc: Line 24: // TODO: CHAR, VARCHAR > Maybe create a JIRA and elaborate on what coverage is missing if it's not g Done Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) { > This only exercises the special case when the union has two members, one of Done, also renamed to NullUnionTest Line 187: TestReadAvroInt32(data, 0, -1, -1, TErrorCode::AVRO_TRUNCATED_BLOCK); > Add a test for the upper bounds maybe, e.g. max int and min it? Done Line 190: // TODO: we don't handle invalid values (e.g. overflow) > Could we add a test to exercise the code path at least? Even if we don't ha It DCHECKs right now, so it would take some changes to even run this case. Line 208: TestReadAvroInt64(data, 10, 64, 2); > Add a test for the upper bounds maybe, e.g. max int and min it? Done Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) { > There's the funky case where in avro it's a float, but we're expecting a do Done (it's done in TestReadAvroFloat()). I also added similar coverage for type promotion of ints and bigints. Line 218: *reinterpret_cast<float*>(data) = f; > memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is o Done Line 230: *reinterpret_cast<double*>(data) = d; > memcpy? Done http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: Line 73: codegend_decode_avro_data_(NULL) { > DCHECK that it's a test? Good idea, done (here and in parent ctors) Line 603: // TODO: handle return value > Are you planning to do this in this patchset? Oops, this is actually super easy to handle, done. Unfortunately there is no testing for Avro records... I think I meant to do this as follow on work for nested types, but then that didn't happen. I filed IMPALA-3660 for now, and I'm gonna try to resolve this ASAP after I finish this patch. http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc File be/src/exec/read-write-util.cc: Line 24: int64_t* buf_len, int64_t* result > Should add a __restrict__ for these args so they can stay in registers. Or Adding __restrict__ doesn't seem to have an impact on performance, but I added it anyway. There probably is some scheme where we can use a fast path and only break out of it after strings, but I think that will be enough plumbing that we should do it in a later patch (if we think it's worth it). Line 29: if (*buf_len <= 0) return false; > UNLIKELY() Done Line 33: more = (**buf & 0x80) != 0; > What is the expected behaviour if the encoded int doesn't fit in a long? There is none. We don't handle invalid data right now (there's a TODO about this in the test, I'll also file a JIRA). http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc File be/src/exec/zigzag-test.cc: Line 61: // TODO: test error cases > Let's add some basic tests at least for overflow, etc. It DCHECKs currently :( I filed https://issues.cloudera.org/browse/IMPALA-3659 http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py File tests/data_errors/test_data_errors.py: Line 86: test_hdfs_rcfile_scan_node_errors > Missed updating the test name Done Line 88: self.run_test_case('DataErrorsTest/avro-errors', vector) > I don't see this file Oops forgot to commit it -- To view, visit http://gerrit.cloudera.org:8080/3072 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
