[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

tdas Tue, 29 Jul 2014 19:26:19 -0700

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1434#discussion_r15564189
  
    --- Diff: 
extras/spark-kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisReceiver.scala
 ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.streaming.kinesis
    +
    +import java.net.InetAddress
    +import java.util.UUID
    +import org.apache.spark.Logging
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.streaming.receiver.Receiver
    +import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
    +import 
com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessor
    +import 
com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorFactory
    +import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
    +import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibConfiguration
    +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker
    +import java.nio.ByteBuffer
    +import org.apache.spark.streaming.util.SystemClock
    +
    +/**
    + * Custom AWS Kinesis-specific implementation of Spark Streaming's 
Receiver.
    + * This implementation relies on the Kinesis Client Library (KCL) Worker 
as described here:
    + * https://github.com/awslabs/amazon-kinesis-client
    + * This is a custom receiver used with 
StreamingContext.receiverStream(Receiver) as described here:
    + * http://spark.apache.org/docs/latest/streaming-custom-receivers.html
    + * Instances of this class will get shipped to the Spark Streaming Workers 
to run within a Spark Executor.
    + *
    + * @param app name
    + * @param Kinesis stream name
    + * @param endpoint url of Kinesis service
    + * @param checkpoint interval (millis) for Kinesis checkpointing (not 
Spark checkpointing).
    + *   See the Kinesis Spark Streaming documentation for more details on the 
different types of checkpoints.
    + * @param in the absence of Kinesis checkpoint info, this is the worker's 
initial starting position in the stream.
    + *   The values are either the beginning of the stream per Kinesis' limit 
of 24 hours (InitialPositionInStream.TRIM_HORIZON)
    + *      or the tip of the stream using InitialPositionInStream.LATEST.
    + * @param persistence strategy for RDDs and DStreams.
    + */
    +private[streaming] class KinesisReceiver(
    +  app: String,
    +  stream: String,
    +  endpoint: String,
    +  checkpointIntervalMillis: Long,
    +  initialPositionInStream: InitialPositionInStream,
    +  storageLevel: StorageLevel)
    +  extends Receiver[Array[Byte]](storageLevel) with Logging { receiver =>
    +
    +  /**
    +   *  The lazy val's below will get instantiated in the remote Executor 
after the closure is shipped to the Spark Worker. 
    +   *  These are all lazy because they're from third-party Amazon libraries 
and are not Serializable.
    +   *  If they're not marked lazy, they will cause 
NotSerializableExceptions when they're shipped to the Spark Worker.
    +   */
    +
    +  /**
    +   *  workerId is lazy because we want the address of the actual Worker 
where the code runs - not the Driver's ip address.
    +   *  This makes a difference when running in a cluster.
    +   */
    +  lazy val workerId = InetAddress.getLocalHost.getHostAddress() + ":" + 
UUID.randomUUID()
    +
    +  /**
    +   * This impl uses the DefaultAWSCredentialsProviderChain per the 
following url:
    +   *    
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html
    +   *  and searches for credentials in the following order of precedence:
    +   * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
    +   * Java System Properties - aws.accessKeyId and aws.secretKey
    +   * Credential profiles file at the default location (~/.aws/credentials) 
shared by all AWS SDKs and the AWS CLI
    +   * Instance profile credentials delivered through the Amazon EC2 
metadata service
    +   */
    +  lazy val credentialsProvider = new DefaultAWSCredentialsProviderChain()
    +
    +  /** Create a KCL config instance. */
    +  lazy val KinesisClientLibConfiguration = new 
KinesisClientLibConfiguration(app, stream, credentialsProvider, workerId)
    +    
.withKinesisEndpoint(endpoint).withInitialPositionInStream(initialPositionInStream).withTaskBackoffTimeMillis(500)
    --- End diff --
    
    Hey, so this is probably not properly documented, but the receivers can be 
started and stopped multiple time (forexample, when receiver.restart(<Error>) 
is called). So creating all these vals once using lazy is not the right way, 
and then calling run/shutdown multiple times is not a good idea. Instead they 
should be created from scratch every time onStart() is called. 
    
    See the pattern followed by 
[FlumeReceiver](https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala#L139).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

Reply via email to