Hi Guy, Thanks for reporting the issue. I am working on it and there will be a PR this week.
Gengliang On Mon, Dec 30, 2019 at 6:41 AM Guy Khazma <guy.kha...@ibm.com> wrote: > Hi, > > It seems that hive style partition pruning is not working for file based > data sources such as Parquet and ORC. > This causes serious performance degradation for non hive tables. > > The reason for that is that the FileScan > < > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala> > > abstract class is not aware of the partition and data filters. > The method for getting the selectedPartitions calls the FileIndex listFiles > method with empty sequence for both - see here > < > https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L74> > > . > > In the v1 datasource the FileSourceScanExec > < > https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L160> > > class gets the partition and data filters and use them to filter > unnecessary > partitions by passing them to the listFiles function - see here > < > https://github.com/apache/spark/blob/5af77410bbb970059d9365b193987e0e44827c20/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L210> > > . > > Are there any ongoing plans to add a support for that? > > Thanks, > Guy > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >