[GitHub] [spark] ulysses-you commented on a diff in pull request #38939: [SPARK-41407][SQL] Pull out v1 write to WriteFiles

GitBox Wed, 07 Dec 2022 18:33:32 -0800


ulysses-you commented on code in PR #38939:
URL: https://github.com/apache/spark/pull/38939#discussion_r1042862929



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala:
##########
@@ -223,6 +224,19 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with 
Logging with Serializ
     doExecuteColumnar()
   }
 
+  /**
+   * Returns the result of writes as an RDD[WriterCommitMessage] variable by 
delegating to
+   * `doExecuteWrite` after preparations.
+   *
+   * Concrete implementations of SparkPlan should override `doExecuteWrite`.
+   */
+  def executeWrite(writeSpec: WriteSpec): RDD[WriterCommitMessage] = 
executeQuery {

Review Comment:
   Let me list the required things of current v1 write files:
   
   - WriteJobDescription, includes hadoop job (hadoop conf), fileFormat, 
outputSpec, partitionColumns, bucketSpec, options, statsTrackers
   - FileCommitProtocol, includes output path, dynamic partition overwrite flag
   - ConcurrentOutputWriterSpec, includes requiredOrdering, bucketSpec, 
physical sortPlan
   
   According to the existed datasource v1 writes command. `WriteFiles` should 
hold at least: FileFormat, OutputSpec, partitionColumns, bucketSpec, options, 
requiredOrdering.
   ```scala
   case class InsertIntoHadoopFsRelationCommand(
       outputPath: Path,
       staticPartitions: TablePartitionSpec,
       ifPartitionNotExists: Boolean,
       partitionColumns: Seq[Attribute],
       bucketSpec: Option[BucketSpec],
       fileFormat: FileFormat,
       options: Map[String, String],
       query: LogicalPlan,
       mode: SaveMode,
       catalogTable: Option[CatalogTable],
       fileIndex: Option[FileIndex],
       outputColumnNames: Seq[String])
   ```
   
   Due to we can not get physical plan at logical side, and 
ConcurrentOutputWriterSpec depend on physical. It should be held in 
`WriteFilesSpec`.
   
   FileCommitProtocol should be held in `WriteFilesSpec`, because `WriteFiles` 
only do the partial work about task due to the pipeline
   `setup job -> set up task -> commit task -> commit job`. And the same reason 
for statsTrackers.
   
   According to the usage of hadoop job (FileCommitProtocol.setup(Job)), I tend 
to make `WriteFilesSpec` hold hadoop job and hadoop conf.
   
   -------------
   In sum:
   
   `WriteFiles`: FileFormat, OutputSpec, partitionColumns, bucketSpec, options 
and requiredOrdering.
   `WriteFilesSpec`: FileCommitProtocol, statsTrackers, 
ConcurrentOutputWriterSpec, hadoop job and hadoop conf.
   
   Notes: the aboved does not consider how hard to implement, just based on 
semantic level.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on a diff in pull request #38939: [SPARK-41407][SQL] Pull out v1 write to WriteFiles

Reply via email to