Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/20647#discussion_r170162261
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
---
@@ -35,15 +35,14 @@ case class DataSourceV2Relation(
options: Map[String, String],
projection: Seq[AttributeReference],
filters: Option[Seq[Expression]] = None,
- userSpecifiedSchema: Option[StructType] = None) extends LeafNode with
MultiInstanceRelation {
+ userSpecifiedSchema: Option[StructType] = None)
+ extends LeafNode with MultiInstanceRelation with DataSourceV2QueryPlan {
--- End diff --
Sorry I didn't realize you removed the extending of
`DataSourceReaderHolder`, otherwise I would point it out in your PR. In general
the question is, how should we define the equality of data source relation? It
was defined by `output, reader.getClass, filters` before, and your PR changed
it to the default equality of `DataSourceV2Relation` silently.
The major difference is, should `options` take part in the equality? The
answer is obviously yes, like `path`, so I'll add `options` to the equality.
BTW `DataSourceV2QueryPlan` is needed, as there are 3 plans need to
implement explain: `DataSourceV2Relation`, `StreamingDataSourceV2Relation`,
`DataSourceV2ScanExec`.
> This doesn't use the source name if it is named
I missed that, let me add back.
> doesn't indicate that the source is v2
I'll improve it
> it doesn't show the most important part of the scan
Do you mean the path option? First I don't think showing all the options is
a good idea, as it can be a lot. My future plan is to show these standard
options like path, table, etc. Again this is something added silently, there is
no consensus about how to explain a data source v2 relation, my PR tries to
have people focus on this part and have a consensus.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]