Hi, It seems that hive style partition pruning is not working for file based data sources such as Parquet and ORC. This causes serious performance degradation for non hive tables.
The reason for that is that the FileScan <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala> abstract class is not aware of the partition and data filters. The method for getting the selectedPartitions calls the FileIndex listFiles method with empty sequence for both - see here <https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74> . In the v1 datasource the FileSourceScanExec <https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160> class gets the partition and data filters and use them to filter unnecessary partitions by passing them to the listFiles function - see here <https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210> . Are there any ongoing plans to add a support for that? Thanks, Guy -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org