[
https://issues.apache.org/jira/browse/IMPALA-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724344#comment-17724344
]
ASF subversion and git services commented on IMPALA-10186:
----------------------------------------------------------
Commit 2fc4f747966552e0f8a0fe1bbe5d50501bb70c3a in impala's branch
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2fc4f7479 ]
IMPALA-10186: Fix writing empty parquet page
Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.
When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.
Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.
Addresses clang-tidy by adding override in source files.
Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
This seems useful as a theoretical corner-case; it currently writes
the too-large value to the page anyway, but if we ever start checking
whether the first value will fit the page this could become an issue.
Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Reviewed-on: http://gerrit.cloudera.org:8080/19898
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Write invalid parquet PageLocations which table sort by some columns
> --------------------------------------------------------------------
>
> Key: IMPALA-10186
> URL: https://issues.apache.org/jira/browse/IMPALA-10186
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.2.0
> Reporter: guojingfeng
> Assignee: Michael Smith
> Priority: Major
> Labels: parquet
> Fix For: Impala 4.3.0
>
>
> Current parquet writer write -1 of PageLocation.offset and
> PageLocation.first_row_index when meet a empty page.
> hdfs-parquet-file-writer.cc Line: 808 ~ 819
> {code:java}
> // Write data pages
> for (const DataPage& page : pages_) {
> if (page.header.data_page_header.num_values == 0) {
> // Skip empty pages
> location.offset = -1;
> location.compressed_page_size = 0;
> location.first_row_index = -1;
> AddLocationToOffsetIndex(location);
> continue;
> }
> {code}
> But -1 values may cause ComputeCandidatePages function run into unexpected
> status.
> {code:java}
> bool ComputeCandidatePages(
> const vector<parquet::PageLocation>& page_locations,
> const vector<RowRange>& candidate_ranges,
> const int64_t num_rows, vector<int>* candidate_pages) {
> if (!ValidatePageLocations(page_locations, num_rows)) return false
> {code}
> and then cause IMPALA-9952
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]