[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

rdblue Mon, 05 Feb 2018 08:40:20 -0800

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20387#discussion_r166030958
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
    @@ -17,17 +17,151 @@
     
     package org.apache.spark.sql.execution.datasources.v2
     
    +import java.util.UUID
    +
    +import scala.collection.JavaConverters._
    +import scala.collection.mutable
    +
    +import org.apache.spark.sql.{AnalysisException, SaveMode}
    +import org.apache.spark.sql.catalyst.TableIdentifier
     import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
    -import org.apache.spark.sql.catalyst.expressions.AttributeReference
    -import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
    -import org.apache.spark.sql.sources.v2.reader._
    +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
    +import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
    +import org.apache.spark.sql.execution.datasources.DataSourceStrategy
    +import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
    +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
    +import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
    +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
    +import org.apache.spark.sql.types.StructType
     
     case class DataSourceV2Relation(
    -    fullOutput: Seq[AttributeReference],
    -    reader: DataSourceReader)
    -  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
    +    source: DataSourceV2,
    +    options: Map[String, String],
    +    path: Option[String] = None,
    +    table: Option[TableIdentifier] = None,
    --- End diff --
    
    @cloud-fan, sorry if it was not clear: Yes, I have considered it and I 
think it is a bad idea. I sent a note to the dev list about this issue, as well 
if you want more context. There are two main reasons:
    
    1. Your proposal creates more places that are responsible for creating a 
`DataSourceOptions` with the right property names. All of the places where we 
have a `TableIdentifier` and want to convert to a `DataSourceV2Relation` need 
to copy the same logic and worry about using the same properties.
        What you propose is hard to maintain and error prone: what happens if 
we decide not to pass the database if it is `None` in a `TableIdentifier`? We 
would have to validate every place that creates a v2 relation. On the other 
hand, if we pass `TableIdentifier` here, we have one code path that converts. 
It is also easier for us to pass `TableIdentifier` to the data sources if we 
choose to update the API.
    2. There is no reason to use `DataSourceOptions` outside of v2 at this 
point. This PR doesn't expose the v2-specific options class to other places in 
the codebase. Instead, it uses a map for generic options and classes that can 
be used in pattern matching where possible. And again, this has fewer places 
that create v2 internal classes, which is easier for maintenance.
    
    If you want to add those methods to the options class so that 
implementations can easily access path and table name, then we can do that in a 
follow-up PR.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

Reply via email to