Tianyi Wang has uploaded a new patch set (#4). Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately ......................................................................
IMPALA-5210: Count rows and collection items in parquet scanner separately This patch adds collection_items_read_counter in scan node, makes rows_read_counter count top-level rows only, and updates these counters in a less frequent manner. When scanning nested columns, current code counts both top-level rows and nested rows in rows_read_counter, which is inconsistent with rows_returned_counter. Furthermore, rows_read_counter is updated eagerly whenever a batch of collection items are read. As a result it spends around 10% time updating the counter with the following simple query: >select count(*) from > customer c, > c.c_orders o, > o.o_lineitems l >where > c_mktsegment = 'BUILDING' > and o_orderdate < '1995-03-15' > and l_shipdate > '1995-03-15' and o_orderkey = 10; This patch moves collection items counting into collection_items_read_counter. Both counters are updated for every row batch read. In the query described above, scanning time is decreased by 10.4%. Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/scan-node.cc M be/src/exec/scan-node.h 4 files changed, 30 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/4 -- To view, visit http://gerrit.cloudera.org:8080/7776 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]>
