Skye Wanderman-Milne has posted comments on this change.

Change subject: IMPALA-3441: check for malformed Avro data
......................................................................


Patch Set 5:

(24 comments)

Dan, we could probably use more codegen testing, but by inspection of the code 
we are testing the relevant paths. I.e. we don't test failure of every type 
with codegen'd code, but most of the codegen'd paths for different types are 
the same and we test all of them. 

Tim, I've addressed most of your comments, but haven't tried two of your 
suggestions in hdfs-avro-scanner.cc (passing data_end instead of data_len and 
branch weight metadata to the codegen'd code). I'll work on those next, but 
I'll post what I have so you can review the other changes in parallel. 

Similarly, I'm not sure where we spend the extra time yet, but I'll 
investigate. I'll have to do it by removing code since all the changes are in 
codegen'd code, so I can't profile normally. ReadZLong would be my guess too.

http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

Line 46:  uint8_t** data
> It may be worth adding a __restrict__ to some of these pointer args. Otherw
It doesn't seem to have an effect, probably because everything's inlined with 
codegen (or I did something wrong, I just added __restrict__ to every data and 
data_len pointer in this file, lemme know if I should do something else).


Line 60:   if (union_position == 2) union_position = 1;
> Could implement this as a bitshift 1 bit to the right. It's possible the co
Also didn't have an effect on my benchmark, hopefully the compiler is doing 
this.


Line 69:   if (*data_len == 0) {
> Should probably add UNLIKELY() annotations to these error conditions. Not s
Done


Line 133: 4
> sizeof(float)? or a named constant
Done


Line 146:   }
> sizeof(float)?
Done


Line 154:   if (*data_len < 8) {
> sizeof(double)? or a named constant
Done


Line 161:   }
> sizeof(double)?
Done


Line 170:   if (!ReadWriteUtil::ReadZLong(data, data_len, &len)) {
> Can we factor out the string checks into a separate function? ReadAvroVarCh
Done


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner-test.cc
File be/src/exec/hdfs-avro-scanner-test.cc:

Line 24: // TODO: CHAR, VARCHAR
> Maybe create a JIRA and elaborate on what coverage is missing if it's not g
Done


Line 131: TEST_F(HdfsAvroScannerTest, UnionTest) {
> This only exercises the special case when the union has two members, one of
Done, also renamed to NullUnionTest


Line 187:   TestReadAvroInt32(data, 0, -1, -1, 
TErrorCode::AVRO_TRUNCATED_BLOCK);
> Add a test for the upper bounds maybe, e.g. max int and min it?
Done


Line 190:   // TODO: we don't handle invalid values (e.g. overflow)
> Could we add a test to exercise the code path at least? Even if we don't ha
It DCHECKs right now, so it would take some changes to even run this case.


Line 208:   TestReadAvroInt64(data, 10, 64, 2);
> Add a test for the upper bounds maybe, e.g. max int and min it?
Done


Line 215: TEST_F(HdfsAvroScannerTest, FloatTest) {
> There's the funky case where in avro it's a float, but we're expecting a do
Done (it's done in TestReadAvroFloat()). I also added similar coverage for type 
promotion of ints and bigints.


Line 218:   *reinterpret_cast<float*>(data) = f;
> memcpy(data, &f, sizeof(float)) seems a little clearer to me, but this is o
Done


Line 230:   *reinterpret_cast<double*>(data) = d;
> memcpy?
Done


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

Line 73:     codegend_decode_avro_data_(NULL) {
> DCHECK that it's a test?
Good idea, done (here and in parent ctors)


Line 603:         // TODO: handle return value
> Are you planning to do this in this patchset?
Oops, this is actually super easy to handle, done.

Unfortunately there is no testing for Avro records... I think I meant to do 
this as follow on work for nested types, but then that didn't happen. I filed 
IMPALA-3660 for now, and I'm gonna try to resolve this ASAP after I finish this 
patch.


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/read-write-util.cc
File be/src/exec/read-write-util.cc:

Line 24: int64_t* buf_len, int64_t* result
> Should add a __restrict__ for these args so they can stay in registers. Or 
Adding __restrict__ doesn't seem to have an impact on performance, but I added 
it anyway.

There probably is some scheme where we can use a fast path and only break out 
of it after strings, but I think that will be enough plumbing that we should do 
it in a later patch (if we think it's worth it).


Line 29:     if (*buf_len <= 0) return false;
> UNLIKELY()
Done


Line 33:     more = (**buf & 0x80) != 0;
> What is the expected behaviour if the encoded int doesn't fit in a long?
There is none. We don't handle invalid data right now (there's a TODO about 
this in the test, I'll also file a JIRA).


http://gerrit.cloudera.org:8080/#/c/3072/5/be/src/exec/zigzag-test.cc
File be/src/exec/zigzag-test.cc:

Line 61: // TODO: test error cases
> Let's add some basic tests at least for overflow, etc.
It DCHECKs currently :( I filed https://issues.cloudera.org/browse/IMPALA-3659


http://gerrit.cloudera.org:8080/#/c/3072/5/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

Line 86: test_hdfs_rcfile_scan_node_errors
> Missed updating the test name
Done


Line 88:     self.run_test_case('DataErrorsTest/avro-errors', vector)
> I don't see this file
Oops forgot to commit it


-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to