[GitHub] spark pull request #21200: [SPARK-24039][SS] Do continuous processing writes...

jose-torres Tue, 01 May 2018 21:40:09 -0700

Github user jose-torres commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21200#discussion_r185391845
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousWriteRDD.scala
 ---
    @@ -0,0 +1,80 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.streaming.continuous
    +
    +import org.apache.spark.{Partition, SparkEnv, TaskContext}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.sql.catalyst.InternalRow
    +import 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask.{logError, 
logInfo}
    +import org.apache.spark.sql.sources.v2.writer.{DataWriter, 
DataWriterFactory, WriterCommitMessage}
    +import org.apache.spark.util.Utils
    +
    +class ContinuousWriteRDD(var prev: RDD[InternalRow], writeTask: 
DataWriterFactory[InternalRow])
    +    extends RDD[Unit](prev) {
    +
    +  override val partitioner = prev.partitioner
    +
    +  override def getPartitions: Array[Partition] = prev.partitions
    +
    +  override def compute(split: Partition, context: TaskContext): 
Iterator[Unit] = {
    +    val epochCoordinator = EpochCoordinatorRef.get(
    +      
context.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
    +      SparkEnv.get)
    +    var currentEpoch = 
context.getLocalProperty(ContinuousExecution.START_EPOCH_KEY).toLong
    +
    +    do {
    +      var dataWriter: DataWriter[InternalRow] = null
    +      // write the data and commit this writer.
    +      Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
    +        try {
    +          val dataIterator = prev.compute(split, context)
    +          dataWriter = writeTask.createDataWriter(
    +            context.partitionId(), context.attemptNumber(), currentEpoch)
    +          while (dataIterator.hasNext) {
    +            dataWriter.write(dataIterator.next())
    +          }
    +          logInfo(s"Writer for partition ${context.partitionId()} " +
    +            s"in epoch $currentEpoch is committing.")
    +          val msg = dataWriter.commit()
    +          epochCoordinator.send(
    +            CommitPartitionEpoch(context.partitionId(), currentEpoch, msg)
    +          )
    +          logInfo(s"Writer for partition ${context.partitionId()} " +
    +            s"in epoch $currentEpoch committed.")
    +          currentEpoch += 1
    --- End diff --
    
    Both nodes do their own independent tracking of currentEpoch. This is 
required; eventually they won't always be on the same machine.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21200: [SPARK-24039][SS] Do continuous processing writes...

Reply via email to