Michael Armbrust created SPARK-2094:
---------------------------------------
Summary: Ensure exactly once semantics for DDL / Commands
Key: SPARK-2094
URL: https://issues.apache.org/jira/browse/SPARK-2094
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Michael Armbrust
Fix For: 1.1.0
>From [~lian cheng]...
The constraints presented here are:
* The side effect of a command SchemaRDD should take place eagerly;
* The side effect of a command SchemaRDD should take place once and only once;
* When .collect() method is called, something meaningful, usually the output
message lines of the command, should be presented.
Then how about adding a lazy field inside all the physical command nodes to
wrap up the side effect and hold the command output? Take the
SetCommandPhysical as an example:
{code}
trait PhysicalCommand(@transient context: SQLContext) {
lazy val commandOutput: Any
}
case class SetCommandPhysical(
key: Option[String], value: Option[String], output: Seq[Attribute])(
@transient context: SQLContext)
extends PhysicalCommand(context)
with PhysicalCommand {
override lazy val commandOutput = {
// Perform the side effect, and record appropriate output
???
}
def execute(): RDD[Row] = {
val row = new GenericRow(Array[Any](commandOutput))
context.sparkContext.parallelize(row, 1)
}
}
{code}
In this way, all the constraints are met:
* Eager evaluation: done by the toRdd call in SchemaRDDLike (PR #948),
* Side effect should take place once and only once: ensured by the lazy
commandOutput field,
* Present meaningful output as RDD contents: command output is held by
commandOutput and returned in execute().
An additional benefit is that, side effect logic of all the commands can be
implemented within their own physical command nodes, instead of adding special
cases inside SQLContext.toRdd and/or HiveContext.toRdd.
--
This message was sent by Atlassian JIRA
(v6.2#6252)