Hello Gabor Kaszab, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19898

to look at the new patch set (#3).

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Addresses clang-tidy by adding override in source files.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,262 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/3
--
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to