Hi everyone,

ParquetFileReader handles both the filtering and the projection. Row group
level filtering is done at construction time so the row groups that do not
fulfil the filter requirements are dropped at the very beginning. The
projected schema can be set by the method setRequestedSchema. In this
method we simply exchange the original schema with this one so everything
will work just like only the columns in the new schema would be in the file
(the others would not be read at all).
The filtering implementations handle the missing columns (the ones that are
specified in the filter but they do not exist in the file) just like if the
related values would be null.

Because the projection does not have influence on the row group level
filtering we might drop row groups based on a column that we would not
read. I think we should handle the non-projected columns just like we
handle the missing ones.

What makes the situation even worse is the column-index based filtering. As
this filtering works at page level and not row group level we run the
related code after the projected schema is set. So, column-index filtering
works differently (which I think is the correct one) than the other row
group level filterings.

I think we should enforce setting the projected schema at the construction
time so all the filtering would work similarly. This would break backward
compatibility, though. However, any fix for this issue would break backward
compatibility because column-indexes are already released as is.

What do you think?

Regards,
Gabor

Reply via email to