[
https://issues.apache.org/jira/browse/IMPALA-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825041#comment-17825041
]
ASF subversion and git services commented on IMPALA-4371:
---------------------------------------------------------
Commit 82103101826309138d22864d04137da2df15f0c3 in impala's branch
refs/heads/branch-3.4.2 from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=821031018 ]
IMPALA-9952: Fix page index filtering for empty pages
As IMPALA-4371 and IMPALA-10186 points out, Impala might write
empty data pages. It usually does that when it has to write a bigger
page than the current page size. If we really need to write empty data
pages is a different question, but we need to handle them correctly
as there are already such files out there.
The corresponding Parquet offset index entries to empty data pages
are invalid PageLocation objects with 'compressed_page_size' = 0.
Before this commit Impala didn't ignore the empty page locations, but
generated a warning. Since invalid page index doesn't fail a scan
by default, Impala continued scanning the file with semi-initialized
page filtering. This resulted in 'Top level rows aren't in sync'
error, or a crash in DEBUG builds.
With this commit Impala ignores empty data pages and still able to
filter the rest of the pages. Also, if the page index is corrupt
for some other reason, Impala correctly resets the page filtering
logic and falls back to regular scanning.
Testing:
* Added unit test for empty data pages
* Added e2e test for empty data pages
* Added e2e test for invalid page index
Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0
Reviewed-on: http://gerrit.cloudera.org:8080/16503
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Incorrect DCHECK-s in hdfs-parquet-table-writer
> -----------------------------------------------
>
> Key: IMPALA-4371
> URL: https://issues.apache.org/jira/browse/IMPALA-4371
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 2.2.4
> Reporter: Zoltan Ivanfi
> Assignee: Zoltan Ivanfi
> Priority: Major
> Fix For: Impala 3.0
>
>
> The following two DCHECK-s in hdfs-parquet-table-writer.cc seem to be invalid:
> {code:java}
> // Last page might be empty
> if (page.header.data_page_header.num_values == 0) {
> DCHECK_EQ(page.header.compressed_page_size, 0);
> DCHECK_EQ(i, num_data_pages_ - 1);
> continue;
> }
> {code}
> The first DCHECK means that if a page's size is 0 then it's compressed size
> is also 0. This, however, seems to be a false assumption, as the compressed
> output may include metadata, such as length or checksum.
> The GZIP compressor, for example, states that an input of 0 bytes requires 23
> bytes when compressed. The Snappy compressor also mentions storing length
> information in the compressed output. The compressed size estimation in the
> LZ4 compressor also contains a constant part.
> The "Last page might be empty" comment and the second DCHECK also seems to be
> based on a false assumption. If a value doesn't fit on the current page,
> {{AppendRow}} creates a new, possibly bigger page and tries writing the data
> in the new page instead. This means that if the data is bigger than the page
> size, then the current page is finalized and a new page is added, even if the
> original page was empty. In other words, empty pages can occur in the middle
> of the {{pages_}} array as well, not only at the end of it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]