gszadovszky commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739824272


   @wangyum, @dongjoon-hyun, parquet-mr filtering handles missing columns 
(columns which are in the filter but not in the file) as if they would contain 
only null values. Projection is like if we would drop the non-projected columns 
from the file. I think we should handle non-projected columns just like the 
non-existing ones.
   The problem is the following: setRequestedSchema is a public method in 
ParquetFileReader therefore it can be invoked any time after the reader is 
constructed. All the row group level filtering is done in the constructor so 
the projected schema does not have any influence on them (unfortunately). I 
think, this is not correct because if the filter contains anything about 
non-projected columns the filtering might drop row groups based on them (using 
statistics or the bloom filters) even though we will not read any of their 
values. Meanwhile, column-indexes work in a different way as we drop pages and 
not row groups. Column-indexes based filtering is executed after the projection 
schema is set so it will have influence on it.
   PARQUET-1765 was only about correcting the calculation of the row count. 
Without that fix the row count might be different than the number of rows 
actually read in the cases of column-indexes filtering used with projected 
columns.
   
   I'll start a discussion on the parquet dev thread about this topic. Feel 
free to join.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to