from:"rdblue"

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166394747
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
--- End diff --

I'm not saying that `DataSourceOptions` have to be handled in the relation. 
Just that the relation should use the same classes to pass data, like 
`TableIdentifier`, that are used by the rest of the planner. I agree with those 
benefits of doing this.

Is there anything that needs to change in this PR? We can move where the 
options are created in a follow-up, but let me know if you think this would 
prevent this from getting merged.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166386661
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
--- End diff --

I think that `TableIdentifier` and a string-to-string `Map` should be 
passed to `DataSourceV2Relation` and that either the relation or 
`DataSourceOptions` should be responsible for creating `DataSourceOptions` with 
well-defined properties to to pass the table information to implementations.

This minimizes the number of places that need to handle `DataSourceOptions` 
(which is specific to v2) and uses `TableIdentifier` and `Map` to match the 
rest of the planner nodes. For example, other read paths that can create 
`DataSourceV2Relation`, like resolution rules, use `TableIdentifier`.

I'm not currently advocating a position on how to configure 
`DataFrameReader` or `DataFrameWriter` or how to handle schemas.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...

2018-02-06 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387
  
@cloud-fan, this is a single commit on purpose because predicate push-down 
makes plan changes. I think it's best to do these at once to avoid unnecessary 
work. That's why I started looking more closely at push-down in the first 
place: updating the other push-down code for immutable plans was a mess.

I also think it is unlikely that we will need to revert the push-down 
changes here. If we end up redesigning push-down, then it is unlikely that the 
easiest starting point is to roll back this fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

2018-02-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20490#discussion_r166381800
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -117,20 +118,43 @@ object DataWritingSparkTask extends Logging {
   writeTask: DataWriterFactory[InternalRow],
   context: TaskContext,
   iter: Iterator[InternalRow]): WriterCommitMessage = {
-val dataWriter = writeTask.createDataWriter(context.partitionId(), 
context.attemptNumber())
+val stageId = context.stageId()
+val partId = context.partitionId()
+val attemptId = context.attemptNumber()
+val dataWriter = writeTask.createDataWriter(partId, attemptId)
 
 // write the data and commit this writer.
 Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
   iter.foreach(dataWriter.write)
-  logInfo(s"Writer for partition ${context.partitionId()} is 
committing.")
-  val msg = dataWriter.commit()
-  logInfo(s"Writer for partition ${context.partitionId()} committed.")
+
+  val msg = if (writeTask.useCommitCoordinator) {
+val coordinator = SparkEnv.get.outputCommitCoordinator
--- End diff --

What do you have in mind to "introduce the concept"?

I'm happy to add more docs. Do you want me to add them to this PR or in a 
follow-up? Are you targeting this for 2.3.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Support commit coordinator fo...

2018-02-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20490#discussion_r166360278
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -117,20 +118,43 @@ object DataWritingSparkTask extends Logging {
   writeTask: DataWriterFactory[InternalRow],
   context: TaskContext,
   iter: Iterator[InternalRow]): WriterCommitMessage = {
-val dataWriter = writeTask.createDataWriter(context.partitionId(), 
context.attemptNumber())
+val stageId = context.stageId()
+val partId = context.partitionId()
+val attemptId = context.attemptNumber()
+val dataWriter = writeTask.createDataWriter(partId, attemptId)
 
 // write the data and commit this writer.
 Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
   iter.foreach(dataWriter.write)
-  logInfo(s"Writer for partition ${context.partitionId()} is 
committing.")
-  val msg = dataWriter.commit()
-  logInfo(s"Writer for partition ${context.partitionId()} committed.")
+
+  val msg = if (writeTask.useCommitCoordinator) {
+val coordinator = SparkEnv.get.outputCommitCoordinator
--- End diff --

The API is flexible. The problem is that it defaults to no coordination, 
which cause correctness bugs.

The safe option is to coordinate commits by default. If an implementation 
doesn't change the default, then it at least won't duplicate task outputs in 
job commit. Worst case is that it takes a little longer for committers that 
don't need coordination. On the other hand, not making this the default will 
cause some writers to work most of the time, but duplicate data in some cases.

What do you think is the down side to adding this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20495: [SPARK-23327] [SQL] Update the description and te...

2018-02-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20495#discussion_r166358420
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -1655,15 +1655,17 @@ case class Left(str: Expression, len: Expression, 
child: Expression) extends Run
  */
 // scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(expr) - Returns the character length of `expr` or number 
of bytes in binary data.",
+  usage = "_FUNC_(expr) - Returns the character length of `expr` or number 
of bytes in binary data. " +
--- End diff --

+1 for string data / binary data


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-05 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166165255
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
+projection: Option[Seq[AttributeReference]] = None,
+filters: Option[Seq[Expression]] = None,
+userSchema: Option[StructType] = None) extends LeafNode with 
MultiInstanceRelation {
+
+  override def simpleString: String = {
+"DataSourceV2Relation(" +
+  s"source=$sourceName${path.orElse(table).map(loc => 
s"($loc)").getOrElse("")}, " +
+  s"schema=[${output.map(a => s"$a 
${a.dataType.simpleString}").mkString(", ")}], " +
+  s"filters=[${pushedFilters.mkString(", ")}] options=$options)"
+  }
+
+  override lazy val schema: StructType = reader.readSchema()
+
+  override lazy val output: Seq[AttributeReference] = {
+projection match {
+  case Some(attrs) =>
+// use the projection attributes to avoid assigning new ids. 
fields that are not projected
+// will be assigned new ids, which is okay because they are not 
projected.
+val attrMap = attrs.map(a => a.name -> a).toMap
+schema.map(f => attrMap.getOrElse(f.name,
+  AttributeReference(f.name, f.dataType, f.nullable, 
f.metadata)()))
+  case _ =>
+schema.toAttributes
+}
+  }
+
+  private lazy val v2Options: DataSourceOptions = {
+// ensure path and table options are set correctly
+val updatedOptions = new mutable.HashMap[String, String]
+updatedOptions ++= options
+
+path match {
+  case Some(p) =>
+updatedOptions.put("path", p)
+  case None =>
+updatedOptions.remove("path")
+}
+
+table.map { ident =>
+  updatedOptions.put("table", ident.table)
--- End diff --

Opened [SPARK-23341](https://issues.apache.org/jira/browse/SPARK-23341) for 
this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-05 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166165005
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
+projection: Option[Seq[AttributeReference]] = None,
+filters: Option[Seq[Expression]] = None,
+userSchema: Option[StructType] = None) extends LeafNode with 
MultiInstanceRelation {
+
+  override def simpleString: String = {
+"DataSourceV2Relation(" +
+  s"source=$sourceName${path.orElse(table).map(loc => 
s"($loc)").getOrElse("")}, " +
+  s"schema=[${output.map(a => s"$a 
${a.dataType.simpleString}").mkString(", ")}], " +
+  s"filters=[${pushedFilters.mkString(", ")}] options=$options)"
+  }
+
+  override lazy val schema: StructType = reader.readSchema()
+
+  override lazy val output: Seq[AttributeReference] = {
+projection match {
+  case Some(attrs) =>
+// use the projection attributes to avoid assigning new ids. 
fields that are not projected
+// will be assigned new ids, which is okay because they are not 
projected.
+val attrMap = attrs.map(a => a.name -> a).toMap
+schema.map(f => attrMap.getOrElse(f.name,
+  AttributeReference(f.name, f.dataType, f.nullable, 
f.metadata)()))
+  case _ =>
+schema.toAttributes
+}
+  }
+
+  private lazy val v2Options: DataSourceOptions = {
+// ensure path and table options are set correctly
+val updatedOptions = new mutable.HashMap[String, String]
+updatedOptions ++= options
+
+path match {
+  case Some(p) =>
+updatedOptions.put("path", p)
+  case None =>
+updatedOptions.remove("path")
+}
+
+table.map { ident =>
+  updatedOptions.put("table", ident.table)
--- End diff --

I think we agree here. I want to avoid doing this outside of either 
`DataSourceOptions` or `DataSourceV2Relation`. If we can create 
`DataSourceOptions` from a `Option[TableIdentifier]` and add the `getTable` 
accessor, then that works for me. My main motivation is to avoid having this 
piece of code copied throughout the SQL planner.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20490: [SPARK-23323][SQL]: Support commit coordinator for DataS...

2018-02-05 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20490
  
I've updated this to no longer require #20387. It wasn't relying on those 
changes at all. @gatorsmile, @cloud-fan, what do you think about getting this 
into 2.3.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-05 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166081367
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
--- End diff --

I should clarify this as well: I like the idea of standardizing how 
implementations access the table name from `DataSourceOptions`. However, I 
don't think that is sufficient to remove the table and path here in the 
definition of `DataSourceV2Relation` for the reasons above.

I think we *should* follow this commit with a plan for how implementations 
access these options. For now, it is good to put the creation of those options 
in a single place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...

2018-02-05 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387
  
> For safety, I wanna keep it unchanged, and start something new for data 
source v2 only.

I disagree.

* **#20476 addresses a bug caused by the new implementation that is not a 
problem if we reuse the current push-down code.** Using an entirely new 
implementation to push filters and projection is going to introduce bugs, and 
that problem demonstrates that it is a real risk.
* **Using unreliable push-down code is going to make it more difficult for 
anyone to use the v2 API.**
* **This approach throws away work that has accumulated over the past few 
years that give us confidence in the current push-down code.** The other code 
paths have push-down tests that will help us catch bugs in the new push-down 
logic. If we limit the scope of this change to v2, we will not be able to reuse 
those tests and will have to write entirely new ones that cover all cases.

Lastly, I think it is clear that we need a design for a new push-down 
mechanism. **Adding this to DataSourceV2 as feature creep is not a good way to 
redesign it.** I'd like to see a design document that addresses some of the 
open questions.

I'd also prefer that this new implementation be removed from the v2 code 
path for 2.3.0. @marmbrus, what do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable l...

2018-02-05 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20387#discussion_r166030958
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,17 +17,151 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.sql.{AnalysisException, SaveMode}
+import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.MultiInstanceRelation
-import org.apache.spark.sql.catalyst.expressions.AttributeReference
-import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, Statistics}
-import org.apache.spark.sql.sources.v2.reader._
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
+import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
+import org.apache.spark.sql.execution.datasources.DataSourceStrategy
+import org.apache.spark.sql.sources.{DataSourceRegister, Filter}
+import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
ReadSupport, ReadSupportWithSchema, WriteSupport}
+import org.apache.spark.sql.sources.v2.reader.{DataSourceReader, 
SupportsPushDownCatalystFilters, SupportsPushDownFilters, 
SupportsPushDownRequiredColumns, SupportsReportStatistics}
+import org.apache.spark.sql.sources.v2.writer.DataSourceWriter
+import org.apache.spark.sql.types.StructType
 
 case class DataSourceV2Relation(
-fullOutput: Seq[AttributeReference],
-reader: DataSourceReader)
-  extends LeafNode with MultiInstanceRelation with DataSourceReaderHolder {
+source: DataSourceV2,
+options: Map[String, String],
+path: Option[String] = None,
+table: Option[TableIdentifier] = None,
--- End diff --

@cloud-fan, sorry if it was not clear: Yes, I have considered it and I 
think it is a bad idea. I sent a note to the dev list about this issue, as well 
if you want more context. There are two main reasons:

1. Your proposal creates more places that are responsible for creating a 
`DataSourceOptions` with the right property names. All of the places where we 
have a `TableIdentifier` and want to convert to a `DataSourceV2Relation` need 
to copy the same logic and worry about using the same properties.
What you propose is hard to maintain and error prone: what happens if 
we decide not to pass the database if it is `None` in a `TableIdentifier`? We 
would have to validate every place that creates a v2 relation. On the other 
hand, if we pass `TableIdentifier` here, we have one code path that converts. 
It is also easier for us to pass `TableIdentifier` to the data sources if we 
choose to update the API.
2. There is no reason to use `DataSourceOptions` outside of v2 at this 
point. This PR doesn't expose the v2-specific options class to other places in 
the codebase. Instead, it uses a map for generic options and classes that can 
be used in pattern matching where possible. And again, this has fewer places 
that create v2 internal classes, which is easier for maintenance.

If you want to add those methods to the options class so that 
implementations can easily access path and table name, then we can do that in a 
follow-up PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387

> Why pushdown is happening in logical optimization and not during query
planning. My first instinct would be to have the optimizer get operators as
close to the leaves as possible and then fuse (or push down) as we convert to
physical plan. I'm probably missing something.

I think there are two reasons, but I'm not fully convinced by either one:

*
[`computeStats`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L232)
is defined on logical plans, so the result of filter push-down needs to be a
logical plan if we want to be able to use accurate stats for a scan. I'm
interested here to ensure that we correctly produce broadcast relations based
on the actual scan stats, not the table-level stats. Maybe there's another way
to do this?
* One of the tests for DSv2 ends up invoking the push-down rule twice,
which made me think about whether or not that should be valid. I think it
probably should be. For example, what if a plan has nodes that can all be
pushed, but they aren't in the right order? Or what if a projection wasn't
pushed through a filter because of a rule problem, but it can still be pushed
down? Incremental fusing during optimization might be an extensible way to
handle odd cases, or it may be useless. I'm not quite sure yet.

It would be great to hear your perspective on these.

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20490: [SPARK-23323][SQL]: Add support for commit coordinator f...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20490
  
@dongjoon-hyun, @cloud-fan, @gatorsmile. Once the immutable plan PR is in, 
this can be reviewed.

@steveloughran, I think this is what you were asking for.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20490: [SPARK-23323][SQL]: Add support for commit coordi...

2018-02-02 Thread rdblue

GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/20490

[SPARK-23323][SQL]: Add support for commit coordinator for DataSourceV2 
writes

## What changes were proposed in this pull request?

DataSourceV2 batch writes should use the output commit coordinator if it is 
required by the data source. This adds a new method, 
`DataWriterFactory#useCommitCoordinator`, that determines whether the 
coordinator will be used. If the write factory returns true, 
`WriteToDataSourceV2` will use the coordinator for batch writes.

This relies on the commits in #20387. Once that is committed, this will be 
rebased. Only the last commit is part of this PR.

## How was this patch tested?

This relies on existing write tests, which now use the commit coordinator.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-23323-add-commit-coordinator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20490.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20490


commit 62c569672083c0fa633da1d6edaba40d0bb05819
Author: Ryan Blue <blue@...>
Date:   2018-01-17T21:58:12Z

SPARK-22386: DataSourceV2: Use immutable logical plans.

commit f0bd45d3c931941b8092cdac738cb29954e0acdd
Author: Ryan Blue <blue@...>
Date:   2018-01-24T19:34:42Z

SPARK-23203: Fix scala style check.

commit 2fdeb4556cd22a092630b341a22a16a59e377183
Author: Ryan Blue <blue@...>
Date:   2018-01-24T19:54:10Z

SPARK-23203: Fix Kafka tests, use StreamingDataSourceV2Relation.

This also removes unused imports.

commit ab945a19efe666c41deae9c044002f3455220c1d
Author: Ryan Blue <blue@...>
Date:   2018-02-02T20:30:33Z

SPARK-23204: DataFrameReader: Remove v2 table identifier parsing.

commit f1d9872a2699cdbd5c87b02e702dc8103335131d
Author: Ryan Blue <blue@...>
Date:   2018-02-02T21:48:29Z

SPARK-23203: Remove import changes from DataSourceV2Utils.

commit 288af6a2729c769e0d4075a8f9190958ab5a211c
Author: Ryan Blue <blue@...>
Date:   2018-02-02T22:21:48Z

SPARK-23323: DataSourceV2: support commit coordinator.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20488: [SPARK-23321][SQL]: Validate datasource v2 writes

2018-02-02 Thread rdblue

GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/20488

[SPARK-23321][SQL]: Validate datasource v2 writes

## What changes were proposed in this pull request?

DataSourceV2 does not currently apply any validation rules when writing. 
Other write paths attempt to validate that a data frame can be written to a 
target table or path and these changes add the same logic to v2.

This updates the logical plan to use InsertIntoTable and applies the insert 
preprocess rules to writes. It also adds a conversion rule from InserIntoTable 
to DataSourceV2Write because InsertIntoTable cannot be used in logical plans 
after analysis.

InsertIntoTable is not necessarily the right logical plan. It assumes that 
the table exists and can report its schema.

## How was this patch tested?

Added a test that fails analysis in the preprocess rule.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
SPARK-23321-validate-datasource-v2-writes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20488.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20488


commit 62c569672083c0fa633da1d6edaba40d0bb05819
Author: Ryan Blue <blue@...>
Date:   2018-01-17T21:58:12Z

SPARK-22386: DataSourceV2: Use immutable logical plans.

commit f0bd45d3c931941b8092cdac738cb29954e0acdd
Author: Ryan Blue <blue@...>
Date:   2018-01-24T19:34:42Z

SPARK-23203: Fix scala style check.

commit 2fdeb4556cd22a092630b341a22a16a59e377183
Author: Ryan Blue <blue@...>
Date:   2018-01-24T19:54:10Z

SPARK-23203: Fix Kafka tests, use StreamingDataSourceV2Relation.

This also removes unused imports.

commit ab945a19efe666c41deae9c044002f3455220c1d
Author: Ryan Blue <blue@...>
Date:   2018-02-02T20:30:33Z

SPARK-23204: DataFrameReader: Remove v2 table identifier parsing.

commit 3580daf15497a1d49112a0eddd556f74b9b3e280
Author: Ryan Blue <blue@...>
Date:   2018-02-02T19:04:23Z

SPARK-23321: Apply preprocess insert rules to DataSourceV2.

This updates the DataSourceV2 write path to use DataSourceV2Relation and
InsertIntoTable to apply the insert preprocess rules.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SQL]: DataSourceV2: Use immutable logical ...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387
  
@cloud-fan, @dongjoon-hyun, @gatorsmile, I've rebased this on master and 
removed the support for SPARK-23204 that parses table identifiers. If you need 
other changes to get this in, let me know. As far as I'm aware, this isn't 
targeting 2.3.0 so it makes sense to keep the `PhysicalOperation` push-down 
rules.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20477: [SPARK-23303][SQL] improve the explain result for...

2018-02-02 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/20477#discussion_r165728696
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
 ---
@@ -36,11 +38,14 @@ import org.apache.spark.sql.types.StructType
  */
 case class DataSourceV2ScanExec(
 fullOutput: Seq[AttributeReference],
-@transient reader: DataSourceReader)
+@transient reader: DataSourceReader,
+@transient sourceClass: Class[_ <: DataSourceV2])
   extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {
 
   override def canEqual(other: Any): Boolean = 
other.isInstanceOf[DataSourceV2ScanExec]
 
+  override def simpleString: String = s"Scan $metadataString"
--- End diff --

+1 for overriding nodeName.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20485: [SPARK-23315][SQL] failed to get output from canonicaliz...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20485
  
Sounds fine to me, then.

My focus is on the long-term design issues. I still think that the changes 
to make plans immutable and to use the existing push-down code as much as 
possible is the best way to get a reliable 2.3.0, but it is fine if they don't 
make the release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387
  
@cloud-fan, I'll update this PR and we can talk about passing configuration 
on the dev list.

And as a reminder, please close #20445.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20387
  
> I tried and can't figure out how to do it with PhysicalOperation, that's 
why I build something new for data source v2 pushdown.

The problem is that we should get DSv2 working independently of a redesign 
of the push-down rules. Throwing an untested push-down rule into changes for 
DSv2 makes the new API less reliable, and hurts people that want to try it out 
and start using it. There is no benefit to doing this for 2.3.0.

I also think a redesign of push-down should be properly designed, thought 
out, and tested. I'm all for fixing this if you can make the case that we need 
to, but we shouldn't needlessly mix together major changes.

@cloud-fan, There's more discussion about this on #20476 that I encourage 
you to read.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20485: [SPARK-23315][SQL] failed to get output from canonicaliz...

2018-02-02 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20485
  
To be clear, the purpose of this commit, like #20476, is just to get 
something working for the 2.3.0 release?

I just want to make sure since I think we should be approaching these 
problems with a better initial design for the integration. I'm fine getting 
this in to unblock a release, but if it isn't for that purpose then I think we 
should fix the design problems first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20476: [SPARK-23301][SQL] data source column pruning should wor...

2018-02-01 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20476
  
Yeah, I did review it, but at the time I wasn't familiar with how the other 
code paths worked and assumed that it was necessary to introduce this. I wasn't 
very familiar with how it *should* work, so I didn't +1 it.

There are a few telling comments though:

> How do we know that there aren't more cases that need to be supported?

> What are the guarantees made by the previous batches in the optimizer? 
The work done by FilterAndProject seems redundant to me because the optimizer 
should already push filters below projection. Is that not guaranteed by the 
time this runs?

In any case, I now think that we should not introduce a new push-down 
design in conjunction with DSv2. Let's get DSv2 working properly and redesign 
push-down separately. In parallel is fine by me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20476: [SPARK-23301][SQL] data source column pruning should wor...

2018-02-01 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20476
  
@gatorsmile, Do you mean this?

> Extensibility is not good and operator push-down capabilities are limited.

If so, that's very open to interpretation. I would assume it means that the 
V2 interfaces should support more than just projection and filter push-down, 
but not a redesign of how push-down happens in the optimizer. Even if it is 
called out as a goal, I now see it as a misguided choice.

But either way, you make a good point about changing things for a release. 
I'll defer to your judgement about what should be done for the release. But for 
the long term, I think this issue underscores my point about reusing code that 
already works. Let's separate DSv2 from a push-down redesign and get it working 
reliably without introducing more risk and unknown problems.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20476: [SPARK-23301][SQL] data source column pruning should wor...

2018-02-01 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20476
  
@gatorsmile, thanks for the context. If we need to redesign push-down, then 
I think we should do that separately and with a design plan.

**I don't think it's a good idea to bundle it into an unrelated API 
update.**

For one thing, we want to be able to use the existing tests for the 
redesigned push-down strategy, not reimplement them in pieces. We also don't 
want to conflate the two changes for early adopters of the new API. V2 should 
be as reliable as possible by minimizing new behavior.

This just isn't the right place to test out experimental designs for 
push-down operations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20466: [SPARK-23293][SQL] fix data source v2 self join

2018-02-01 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20466
  
+1

Good to get this in before changes to the relation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20476: [SPARK-23301][SQL] data source column pruning should wor...

2018-02-01 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/20476
  
@cloud-fan, @gatorsmile, this PR demonstrates why we should use 
PhysicalOperation. I ported the tests from this PR over to our branch and they 
pass without modifying the push-down code. That's because it reuses code that 
we already trust.

I'm see no benefit to using a brand new code path for push-down when we can 
use what is already well tested. I know you want to push other operations, but 
I've already raised concerns about the design of this new code: it is brittle 
because it requires matching specific plan nodes.

Push-down should work as it always has: by pushing nodes that are adjacent 
to relations in the logical plan and relying on the optimizer to push 
projections and filters down as far as possible. The separation of concerns 
into simple rules is fundamental to the design of the optimizer. I don't think 
there is a good argument for new code that breaks how the optimizer is intended 
to work.

cc @henryr, who might want to chime in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1330 matches

Mail list logo