Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16503
Change subject: IMPALA-9952: Fix page index filtering for empty pages ...................................................................... IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty date pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-common-test.cc M be/src/exec/parquet/parquet-common.cc M testdata/data/README A testdata/data/alltypes_empty_pages.parquet A testdata/data/alltypes_invalid_pages.parquet M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 9 files changed, 174 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16503/1 -- To view, visit http://gerrit.cloudera.org:8080/16503 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Gerrit-Change-Number: 16503 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
