parquet filtering + projection

Gabor Szadovszky Mon, 13 Jan 2020 05:10:07 -0800

Hi All,

Current parquet filters handles missing columns (that are not in the file)
as if their values were all null. This is completely logical. The question
is how shall parquet filtering handle the columns that are in the file
(with real values) but missing in the projection.
I've thought during the column indexes implementation that this situation
is clear. The projection restricts the visible columns to the ones
specified by the user so columns that are in the file but not in the
projection shall be handled the same way as columns are not in the file.
This is the way column index filtering is implemented. (To guarantee that
only the correct records will be retrieved we need to read the columns in
the filter to check the values one by one.)
The problem is that the other filters (dictionary and statistics filter) do
not care about the projection. Because of that parquet 1.11.0 introduced a
regression in case of filtering on columns that are not in the projection
(but are in the file).
It think, column index filtering works correctly but I am curious about
your opinions.


Thanks a lot,
Gabor

parquet filtering + projection

Reply via email to