Skye Wanderman-Milne has uploaded a new patch set (#2). Change subject: PREVIEW IMPALA-3441: check for malformed Avro data ......................................................................
PREVIEW IMPALA-3441: check for malformed Avro data This patch adds error checking to the Avro scanner (both the codegen'd and interepted paths), including out-of-bounds checks and data validity checks. I ran a local benchmark using the following query: set num_scanner_threads=1; select max(i) from default.avro_ints_big; where avro_ints_big is an Avro table with a single int column containing ~90MM values. With this patch, the total query time goes from 1.6s to X.Xs (XX% increase), with the MaterializeTupleTime going from 975ms to XXXXms (XX% increase). TODO: - I plan to write unit tests for most of these cases, and one or two malformed files for end-to-end tests. It's too hard to exercise all these cases with end-to-end tests. - Perf numbers / improvements Tests ran: ./run-tests.py query_test/test_scanners.py --table_formats avro/snap Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 --- M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/read-write-util.cc M be/src/exec/read-write-util.h M common/thrift/generate_error_codes.py 6 files changed, 271 insertions(+), 99 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/72/3072/2 -- To view, visit http://gerrit.cloudera.org:8080/3072 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
