Hi everyone, ParquetFileReader handles both the filtering and the projection. Row group level filtering is done at construction time so the row groups that do not fulfil the filter requirements are dropped at the very beginning. The projected schema can be set by the method setRequestedSchema. In this method we simply exchange the original schema with this one so everything will work just like only the columns in the new schema would be in the file (the others would not be read at all). The filtering implementations handle the missing columns (the ones that are specified in the filter but they do not exist in the file) just like if the related values would be null.
Because the projection does not have influence on the row group level filtering we might drop row groups based on a column that we would not read. I think we should handle the non-projected columns just like we handle the missing ones. What makes the situation even worse is the column-index based filtering. As this filtering works at page level and not row group level we run the related code after the projected schema is set. So, column-index filtering works differently (which I think is the correct one) than the other row group level filterings. I think we should enforce setting the projected schema at the construction time so all the filtering would work similarly. This would break backward compatibility, though. However, any fix for this issue would break backward compatibility because column-indexes are already released as is. What do you think? Regards, Gabor
