gszadovszky commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739824272
@wangyum, @dongjoon-hyun, parquet-mr filtering handles missing columns (columns which are in the filter but not in the file) as if they would contain only null values. Projection is like if we would drop the non-projected columns from the file. I think we should handle non-projected columns just like the non-existing ones. The problem is the following: setRequestedSchema is a public method in ParquetFileReader therefore it can be invoked any time after the reader is constructed. All the row group level filtering is done in the constructor so the projected schema does not have any influence on them (unfortunately). I think, this is not correct because if the filter contains anything about non-projected columns the filtering might drop row groups based on them (using statistics or the bloom filters) even though we will not read any of their values. Meanwhile, column-indexes work in a different way as we drop pages and not row groups. Column-indexes based filtering is executed after the projection schema is set so it will have influence on it. PARQUET-1765 was only about correcting the calculation of the row count. Without that fix the row count might be different than the number of rows actually read in the cases of column-indexes filtering used with projected columns. I'll start a discussion on the parquet dev thread about this topic. Feel free to join. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
