[GitHub] spark pull request: [SPARK-4964][Streaming][Kafka] More updates to...

tdas Sun, 08 Feb 2015 03:11:47 -0800

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4384#discussion_r24272277
  
    --- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
    @@ -179,121 +182,190 @@ object KafkaUtils {
           errs => throw new SparkException(errs.mkString("\n")),
           ok => ok
         )
    -    new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, offsetRanges, 
leaders, messageHandler)
    +    new KafkaRDD[K, V, KD, VD, (K, V)](sc, kafkaParams, offsetRanges, 
leaders, messageHandler)
       }
     
    -  /** A batch-oriented interface for consuming from Kafka.
    -   * Starting and ending offsets are specified in advance,
    -   * so that you can control exactly-once semantics.
    +  /**
    +   * :: Experimental ::
    +   * Create a RDD from Kafka using offset ranges for each topic and 
partition. This allows you
    +   * specify the Kafka leader to connect to (to optimize fetching) and 
access the message as well
    +   * as the metadata.
    +   *
        * @param sc SparkContext object
        * @param kafkaParams Kafka <a 
href="http://kafka.apache.org/documentation.html#configuration";>
    -   * configuration parameters</a>.
    -   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set 
with Kafka broker(s),
    -   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
    +   *    configuration parameters</a>. Requires "metadata.broker.list" or 
"bootstrap.servers"
    +   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in
    +   *    host1:port1,host2:port2 form.
        * @param offsetRanges Each OffsetRange in the batch corresponds to a
        *   range of offsets for a given Kafka topic/partition
        * @param leaders Kafka leaders for each offset range in batch
    -   * @param messageHandler function for translating each message into the 
desired type
    +   * @param messageHandler function for translating each message and 
metadata into the desired type
        */
       @Experimental
       def createRDD[
         K: ClassTag,
         V: ClassTag,
    -    U <: Decoder[_]: ClassTag,
    -    T <: Decoder[_]: ClassTag,
    -    R: ClassTag] (
    +    KD <: Decoder[K]: ClassTag,
    +    VD <: Decoder[V]: ClassTag,
    +    R: ClassTag](
           sc: SparkContext,
           kafkaParams: Map[String, String],
           offsetRanges: Array[OffsetRange],
           leaders: Array[Leader],
           messageHandler: MessageAndMetadata[K, V] => R
    -  ): RDD[R] = {
    -
    +    ): RDD[R] = {
         val leaderMap = leaders
           .map(l => TopicAndPartition(l.topic, l.partition) -> (l.host, 
l.port))
           .toMap
    -    new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaderMap, 
messageHandler)
    +    new KafkaRDD[K, V, KD, VD, R](sc, kafkaParams, offsetRanges, 
leaderMap, messageHandler)
       }
     
    +
       /**
    -   * This stream can guarantee that each message from Kafka is included in 
transformations
    -   * (as opposed to output actions) exactly once, even in most failure 
situations.
    +   * Create a RDD from Kafka using offset ranges for each topic and 
partition.
        *
    -   * Points to note:
    -   *
    -   * Failure Recovery - You must checkpoint this stream, or save offsets 
yourself and provide them
    -   * as the fromOffsets parameter on restart.
    -   * Kafka must have sufficient log retention to obtain messages after 
failure.
    -   *
    -   * Getting offsets from the stream - see programming guide
    +   * @param jsc JavaSparkContext object
    +   * @param kafkaParams Kafka <a 
href="http://kafka.apache.org/documentation.html#configuration";>
    +   *    configuration parameters</a>. Requires "metadata.broker.list" or 
"bootstrap.servers"
    +   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in
    +   *    host1:port1,host2:port2 form.
    +   * @param offsetRanges Each OffsetRange in the batch corresponds to a
    +   *   range of offsets for a given Kafka topic/partition
    +   */
    +  @Experimental
    +  def createRDD[K, V, KD <: Decoder[K], VD <: Decoder[V]](
    +      jsc: JavaSparkContext,
    +      keyClass: Class[K],
    +      valueClass: Class[V],
    +      keyDecoderClass: Class[KD],
    +      valueDecoderClass: Class[VD],
    +      kafkaParams: JMap[String, String],
    +      offsetRanges: Array[OffsetRange]
    +    ): JavaPairRDD[K, V] = {
    +    implicit val keyCmt: ClassTag[K] = ClassTag(keyClass)
    +    implicit val valueCmt: ClassTag[V] = ClassTag(valueClass)
    +    implicit val keyDecoderCmt: ClassTag[KD] = ClassTag(keyDecoderClass)
    +    implicit val valueDecoderCmt: ClassTag[VD] = 
ClassTag(valueDecoderClass)
    +    new JavaPairRDD(createRDD[K, V, KD, VD](
    +      jsc.sc, Map(kafkaParams.toSeq: _*), offsetRanges))
    +  }
    +
    +  /**
    +   * :: Experimental ::
    +   * Create a RDD from Kafka using offset ranges for each topic and 
partition. This allows you
    +   * specify the Kafka leader to connect to (to optimize fetching) and 
access the message as well
    +   * as the metadata.
        *
    -.  * Zookeeper - This does not use Zookeeper to store offsets.  For 
interop with Kafka monitors
    -   * that depend on Zookeeper, you must store offsets in ZK yourself.
    +   * @param jsc JavaSparkContext object
    +   * @param kafkaParams Kafka <a 
href="http://kafka.apache.org/documentation.html#configuration";>
    +   *    configuration parameters</a>. Requires "metadata.broker.list" or 
"bootstrap.servers"
    +   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in
    +   *    host1:port1,host2:port2 form.
    +   * @param offsetRanges Each OffsetRange in the batch corresponds to a
    +   *   range of offsets for a given Kafka topic/partition
    +   * @param leaders Kafka leaders for each offset range in batch
    +   * @param messageHandler function for translating each message and 
metadata into the desired type
    +   */
    +  @Experimental
    +  def createRDD[K, V, KD <: Decoder[K], VD <: Decoder[V], R](
    +      jsc: JavaSparkContext,
    +      keyClass: Class[K],
    +      valueClass: Class[V],
    +      keyDecoderClass: Class[KD],
    +      valueDecoderClass: Class[VD],
    +      recordClass: Class[R],
    +      kafkaParams: JMap[String, String],
    +      offsetRanges: Array[OffsetRange],
    +      leaders: Array[Leader],
    +      messageHandler: JFunction[MessageAndMetadata[K, V], R]
    +    ): JavaRDD[R] = {
    +    implicit val keyCmt: ClassTag[K] = ClassTag(keyClass)
    +    implicit val valueCmt: ClassTag[V] = ClassTag(valueClass)
    +    implicit val keyDecoderCmt: ClassTag[KD] = ClassTag(keyDecoderClass)
    +    implicit val valueDecoderCmt: ClassTag[VD] = 
ClassTag(valueDecoderClass)
    +    implicit val recordCmt: ClassTag[R] = ClassTag(recordClass)
    +    createRDD[K, V, KD, VD, R](
    +      jsc.sc, Map(kafkaParams.toSeq: _*), offsetRanges, leaders, 
messageHandler.call _)
    +  }
    +
    +  /**
    +   * :: Experimental ::
    +   * Create an input stream that pulls messages from a Kafka Broker. This 
stream can guarantee
    +   * that each message from Kafka is included in transformations exactly 
once (see points below).
        *
    -   * End-to-end semantics - This does not guarantee that any output 
operation will push each record
    -   * exactly once. To ensure end-to-end exactly-once semantics (that is, 
receiving exactly once and
    -   * outputting exactly once), you have to either ensure that the output 
operation is
    -   * idempotent, or transactionally store offsets with the output. See the 
programming guide for
    -   * more details.
    +   * Points to note:
    +   *  - No receivers: This stream does not use any receiver. It directly 
queries Kafka
    +   *  - Offsets: This does not use Zookeeper to store offsets. The 
consumed offsets are tracked
    +   *    by the stream itself. For interoperability with Kafka monitoring 
tools that depend on 
    +   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the 
streaming application.
    +   *  - Failure Recovery: To recover from driver failures, you have to 
enable checkpointing
    --- End diff --
    
    Of course not! You can do all windowing (except reduceByKeyAndWindow with 
inverse functions) without checkpointing enabled. And lots of people use Spark 
Streaming without checkpointing, but use windows. 
    
    It is a small nuance, but here is how i reason it in my head. If you care 
about recovering stuff through driver failures, you HAVE to enable 
checkpointing first. Only then can you get driver fault tolerance for all 
operations, as long as your code does not change. So you get recoverability but 
not upgradability. If you also want upgradability across code and Spark 
versions, you can save offsets somewhere else and recover from that. Yes, for 
some limited cases, you can get recoverability using that mechanism. But that 
is delving in to too much details for any reader. So if there is ONE thing that 
a developer needs to know to get driver fault-tolerance, it is "enable 
checkpointing", and "not saving offsets somewhere else". 
    
    Does that make sense?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964][Streaming][Kafka] More updates to...

Reply via email to