Dan Hecht has posted comments on this change. Change subject: IMPALA-3441, IMPALA-3659: check for malformed Avro data ......................................................................
Patch Set 10: (19 comments) http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: Line 117: if (!parse_status_.ok()) return false; why is this line needed? PS10, Line 134: ( nit: remove extraneous parens Line 142: } else if (type == TYPE_DOUBLE) { could save this cmp/branch by rewriting: } else { DCHECK_EQ(type, TYPE_DOUBLE); ... } but okay to ignore in this patch PS10, Line 154: ( nit: remove parens PS10, Line 178: ( remove parens http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: Line 75: } let's move this adjacent to the other constructor to make it easier to notice if one needs updating the other should to. http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: PS10, Line 214: : add data_end_val Line 240: /// is exceeded when allocating the new char buffer. It returns true otherwise. document return value / parse_status_ side effects PS10, Line 260: . mention it updates parse_status_ on error PS10, Line 278: outside of xcompiled functions why is this important? is it to just limit the amount of IR or is there another reason it has to be done in native code? (I'm fine with this factoring, just want to clarify the comment). http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/read-write-util.cc File be/src/exec/read-write-util.cc: Line 28: } this could be unrolled, but if we're getting good perf as is, okay to leave alone for now. Line 31: if ((y & 0x8000) == 0) return 10; i.e. like you've unrolled this already. PS10, Line 61: MAX_ZLONG_LEN put a brief comment explaining why this is MAX_ZLONG_LEN rather than max_len (it's to protect FindZLongLength() which we use even if it's a ZInt). also remove extra parens. http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: Line 99: }; maybe wrap this in "namespace internal" to help ensure the typedefs are used instead. PS10, Line 106: 'buf_len' is decremented update comment PS10, Line 109: max_len = MAX_ZLONG_LEN does any other callsites beside ReadZInt() want to set this parameter to something different? If not, we could save passing this parameter at every callsite (and also propagate the constant) by defining: template<int MAX_LEN, typename ZResult> ZResult ReadZInteger(buf, buf_end) which is then wrapped by ReadZLong, ReadZInt. (or get rid of the second template arg by having a single result type ZResult which is <bool, int64_t> see below also). Line 239: if (r.error) return ZIntResult(); rather than adding this cmp/br, we could do the template described above, or have a constructor for ZIntResult that takes a ZLongResult. Or we could just have a single result type: ZResult which is <bool, int64_t> which can work okay for both functions. http://gerrit.cloudera.org:8080/#/c/3072/10/be/src/exec/scanner-context.inline.h File be/src/exec/scanner-context.inline.h: Line 116: inline bool ScannerContext::Stream::ReadVLong(int64_t* value, Status* status) { are these "V" routines used by avro? do they need any fixing? Line 156: if (r.error) return false; > UNLIKELY? don't we need to set *status on the r.error path? would be good to add a test case for this path. -- To view, visit http://gerrit.cloudera.org:8080/3072 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
