Mostafa Mokhtar has posted comments on this change. Change subject: IMPALA-5036: Parquet count star optimization ......................................................................
Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/6812/1/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 455: DCHECK_LE(row_group_rows_read_, file_metadata_.num_rows); What if file_metadata_.num_rows or file_metadata_.row_groups[row_group_idx_].num_rows have negative values? We have seen cases where a single file had too many rows which causes an overflow and stats had a negative value. Line 1455: // Column readers are not needed because we are not reading from any columns if this > DCHECK that there is exactly one materialized slot Can we then optimize something like select count(l_comment) from lineitem to select count(*) from lineitem The later is 7x faster. http://gerrit.cloudera.org:8080/#/c/6812/1/testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-stats-agg.test: Line 34: | output: sum_zero_if_empty(functional_parquet.alltypes.parquet-stats: num_rows) > i don't know what this means. Why do we need to print this information in the plan? Won't this be enabled for all Parquet files moving forward? -- To view, visit http://gerrit.cloudera.org:8080/6812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-HasComments: Yes
