[GitHub] spark pull request #16481: [SPARK-19092] [SQL] Save() API of DataFrameWriter...

gatorsmile Thu, 12 Jan 2017 15:00:06 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16481#discussion_r95905076
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -413,10 +413,85 @@ case class DataSource(
         relation
       }
     
    -  /** Writes the given [[DataFrame]] out to this [[DataSource]]. */
    -  def write(
    -      mode: SaveMode,
    -      data: DataFrame): BaseRelation = {
    +  /**
    +   * Writes the given [[DataFrame]] out in this [[FileFormat]].
    +   */
    +  private def writeInFileFormat(format: FileFormat, mode: SaveMode, data: 
DataFrame): Unit = {
    +    // Don't glob path for the write path.  The contracts here are:
    +    //  1. Only one output path can be specified on the write path;
    +    //  2. Output path must be a legal HDFS style file system path;
    +    //  3. It's OK that the output path doesn't exist yet;
    +    val allPaths = paths ++ caseInsensitiveOptions.get("path")
    +    val outputPath = if (allPaths.length == 1) {
    +      val path = new Path(allPaths.head)
    +      val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      path.makeQualified(fs.getUri, fs.getWorkingDirectory)
    +    } else {
    +      throw new IllegalArgumentException("Expected exactly one path to be 
specified, but " +
    +        s"got: ${allPaths.mkString(", ")}")
    +    }
    +
    +    val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
    +    PartitioningUtils.validatePartitionColumn(data.schema, 
partitionColumns, caseSensitive)
    +
    +    // If we are appending to a table that already exists, make sure the 
partitioning matches
    +    // up.  If we fail to load the table for whatever reason, ignore the 
check.
    +    if (mode == SaveMode.Append) {
    +      val existingPartitionColumns = Try {
    +        getOrInferFileFormatSchema(format, justPartitioning = 
true)._2.fieldNames.toList
    +      }.getOrElse(Seq.empty[String])
    +      // TODO: Case sensitivity.
    +      val sameColumns =
    +        existingPartitionColumns.map(_.toLowerCase()) == 
partitionColumns.map(_.toLowerCase())
    +      if (existingPartitionColumns.nonEmpty && !sameColumns) {
    +        throw new AnalysisException(
    +          s"""Requested partitioning does not match existing partitioning.
    +             |Existing partitioning columns:
    +             |  ${existingPartitionColumns.mkString(", ")}
    +             |Requested partitioning columns:
    +             |  ${partitionColumns.mkString(", ")}
    +             |""".stripMargin)
    +      }
    +    }
    +
    +    // SPARK-17230: Resolve the partition columns so 
InsertIntoHadoopFsRelationCommand does
    +    // not need to have the query as child, to avoid to analyze an 
optimized query,
    +    // because InsertIntoHadoopFsRelationCommand will be optimized first.
    +    val partitionAttributes = partitionColumns.map { name =>
    +      val plan = data.logicalPlan
    +      plan.resolve(name :: Nil, 
data.sparkSession.sessionState.analyzer.resolver).getOrElse {
    +        throw new AnalysisException(
    +          s"Unable to resolve $name given 
[${plan.output.map(_.name).mkString(", ")}]")
    +      }.asInstanceOf[Attribute]
    +    }
    +    val fileIndex = catalogTable.map(_.identifier).map { tableIdent =>
    +      sparkSession.table(tableIdent).queryExecution.analyzed.collect {
    +        case LogicalRelation(t: HadoopFsRelation, _, _) => t.location
    +      }.head
    +    }
    +    // For partitioned relation r, r.schema's column ordering can be 
different from the column
    +    // ordering of data.logicalPlan (partition columns are all moved after 
data column).  This
    +    // will be adjusted within InsertIntoHadoopFsRelation.
    +    val plan =
    +      InsertIntoHadoopFsRelationCommand(
    +        outputPath = outputPath,
    +        staticPartitions = Map.empty,
    +        partitionColumns = partitionAttributes,
    +        bucketSpec = bucketSpec,
    +        fileFormat = format,
    +        options = options,
    +        query = data.logicalPlan,
    +        mode = mode,
    +        catalogTable = catalogTable,
    +        fileIndex = fileIndex)
    +      sparkSession.sessionState.executePlan(plan).toRdd
    --- End diff --
    
    To the reviewers, the part in `writeInFileFormat` has no code change.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16481: [SPARK-19092] [SQL] Save() API of DataFrameWriter...

Reply via email to