Github user chenghao-intel commented on a diff in the pull request:
https://github.com/apache/spark/pull/8023#discussion_r37258905
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
---
@@ -183,6 +183,16 @@ private[sql] case class InMemoryRelation(
batchStats).asInstanceOf[this.type]
}
+ private[sql] def withChild(newChild: SparkPlan): this.type = {
--- End diff --
@yhuai @liancheng After double checking the source code, the spark plan of
`InMemoryRelation` is the `PhysicalRDD`, which hold a data source scanning RDD
instances as its property.
That's what I mean we will not take the latest files under the path when
`recache` method called, because the `RDD` is materialized already and never
been changed, this PR will re-created the SparkPlan from the logical plan, and
the `DataSourceStrategy` will rebuild the RDD based on the latest files.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L99
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L312
I've actually tried some other approaches for the fixing:
1. Update the code of `PhyscialRDD`, to take the RDDBuilder instead of the
RDD for as its property, however this failed due to widely impact the existed
code.
2. Create a customized RDD, which take the path as parameter (instead of
the file status), however, it's requires lots of interface changed in
`HadoopFsRelation`, as `inputFiles: Array[FileStatus]` is widely used for
`buildScan`, particularly the partition pruning is done in the
`DataSourceStrategy`, not the `HadoopFsRelation`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]