rdblue commented on a change in pull request #25955: [SPARK-29277][SQL] Add
early DSv2 filter and projection pushdown
URL: https://github.com/apache/spark/pull/25955#discussion_r336271320
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -55,7 +56,52 @@ case class DataSourceV2Relation(
}
override def computeStats(): Statistics = {
- val scan = newScanBuilder().build()
+ if (Utils.isTesting) {
+ // when testing, throw an exception if this computeStats method is
called because stats should
+ // not be accessed before pushing the projection and filters to create a
scan. otherwise, the
+ // stats are not accurate because they are based on a full table scan of
all columns.
+ throw new UnsupportedOperationException(
+ s"BUG: computeStats called before pushdown on DSv2 relation: $name")
+ } else {
+ // when not testing, return stats because bad stats are better than
failing a query
+ newScanBuilder() match {
+ case r: SupportsReportStatistics =>
+ val statistics = r.estimateStatistics()
+ DataSourceV2Relation.transformV2Stats(statistics, None,
conf.defaultSizeInBytes)
+ case _ =>
+ Statistics(sizeInBytes = conf.defaultSizeInBytes)
+ }
+ }
+ }
+
+ override def newInstance(): DataSourceV2Relation = {
+ copy(output = output.map(_.newInstance()))
+ }
+}
+
+/**
+ * A logical plan for a DSv2 table with a scan already created.
+ *
+ * This is used in the optimizer to push filters and projection down before
conversion to physical
+ * plan. This ensures that the stats that are used by the optimizer account
for the filters and
+ * projection that will be pushed down.
+ *
+ * @param table a DSv2 [[Table]]
+ * @param scan a DSv2 [[Scan]]
+ * @param output the output attributes of this relation
+ */
+case class DataSourceV2ScanRelation(
Review comment:
I did this to simplify the code. Now places that use `optimizedPlan` expect
the scan relation.
There was one minor problem with this. To address the concern about
relations used by DDL commands like `AlterTable` getting modified, those plans
no longer list the relation as a child, so rules are not automatically run on
it. For those DDL commands, `DataSourceV2Relation` is still in the optimized
plan because it is no longer converted. I think this is correct behavior.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]