Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
All,

Can we run different version of Spark using the same Mesos Dispatcher. For 
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


RE: Spark Mesos Dispatcher

2015-07-19 Thread Jahagirdar, Madhu
1.3 does not have MesosDisptacher or does not have support for Mesos cluster 
mode , is it still possible to create a Dispatcher using 1.4 and run 1.3 using 
that dispatcher ?

From: Jerry Lam [chiling...@gmail.com]
Sent: Monday, July 20, 2015 8:27 AM
To: Jahagirdar, Madhu
Cc: user; d...@spark.apache.org
Subject: Re: Spark Mesos Dispatcher

Yes.

Sent from my iPhone

On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu 
madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote:

All,

Can we run different version of Spark using the same Mesos Dispatcher. For 
example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ?

Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


Spark Drill 1.2.1 - error

2015-02-26 Thread Jahagirdar, Madhu
All,

We are getting the below error when we are using Drill JDBC driver with spark, 
please let us know what could be the issue.


java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian 
cannot access its superclass io.netty.buffer.WrappedByteBuf
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.apache.drill.exec.memory.TopLevelAllocator.init(TopLevelAllocator.java:43)
at 
org.apache.drill.exec.memory.TopLevelAllocator.init(TopLevelAllocator.java:68)
at org.apache.drill.jdbc.DrillConnectionImpl.init(DrillConnectionImpl.java:91)
at 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.init(DrillJdbc41Factory.java:88)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43)
at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51)
at 
net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
at org.apache.spark.rdd.JdbcRDD$$anon$1.init(JdbcRDD.scala:76)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/26 10:16:03 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.IllegalAccessError: class io.netty.buffer.UnsafeDirectLittleEndian 
cannot access its superclass io.netty.buffer.WrappedByteBuf
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.apache.drill.exec.memory.TopLevelAllocator.init(TopLevelAllocator.java:43)
at 
org.apache.drill.exec.memory.TopLevelAllocator.init(TopLevelAllocator.java:68)
at org.apache.drill.jdbc.DrillConnectionImpl.init(DrillConnectionImpl.java:91)
at 
org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection.init(DrillJdbc41Factory.java:88)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:57)
at 
org.apache.drill.jdbc.DrillJdbc41Factory.newDrillConnection(DrillJdbc41Factory.java:43)
at org.apache.drill.jdbc.DrillFactory.newConnection(DrillFactory.java:51)
at 
net.hydromatic.avatica.UnregisteredDriver.connect(UnregisteredDriver.java:126)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
at $line13.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:17)
at org.apache.spark.rdd.JdbcRDD$$anon$1.init(JdbcRDD.scala:76)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:73)
at org.apache.spark.rdd.JdbcRDD.compute(JdbcRDD.scala:53)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.
Regards,
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the 

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Jahagirdar, Madhu
Foreach iterates through the partitions in the RDD and executes the operations 
for each partitions i guess.

 On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote:

 Hi All,

 Please clarify.

 Can we say 1 RDD is generated every batch interval?

 If the above is true. Then, is the foreachRDD() operator executed one  only
 once for each batch processing?

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-say-1-RDD-is-generated-every-batch-interval-tp20885.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu
Michael any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread main java.lang.NullPointerException
at org.apache.spark.rdd.RDD.init(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length  3) {
  logInfo(Please provide valid parameters: hdfsFilesLocation: 
hdfs://ip:8020/user/hdfs/--/ IMPALAtableloc hdfs://ip:8020/user/hive/--/ 
tablename)
  logInfo(make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/)
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()={
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName(Json to 
Parquet).set(spark.cores.max, 3)

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = name age
val schema =
  StructType(
schemaString.split( ).map(fieldName = StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd = {
  if(rdd !=null  rdd.count()0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo(inserting into table:  + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e

CheckPoint Issue with JsonRDD

2014-11-06 Thread Jahagirdar, Madhu
When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread main java.lang.NullPointerException
at org.apache.spark.rdd.RDD.init(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length  3) {
  logInfo(Please provide valid parameters: hdfsFilesLocation: 
hdfs://ip:8020/user/hdfs/--/ IMPALAtableloc hdfs://ip:8020/user/hive/--/ 
tablename)
  logInfo(make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/)
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()={
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName(Json to 
Parquet).set(spark.cores.max, 3)

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = name age
val schema =
  StructType(
schemaString.split( ).map(fieldName = StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd = {
  if(rdd !=null  rdd.count()0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo(inserting into table:  + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


Dynamically InferSchema From Hive and Create parquet file

2014-11-05 Thread Jahagirdar, Madhu
Currently the createParquetMethod needs BeanClass as one of the parameters.


javahiveContext.createParquetFile(XBean.class,


IMPALA_TABLE_LOC, true, new Configuration())


.registerTempTable(TEMP_TABLE_NAME);


Is it possible that we dynamically Infer Schema From Hive using hive context 
and the table name, then give that Schema ?


Regards.

Madhu Jahagirdar







The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


Issue with Spark Twitter Streaming

2014-10-13 Thread Jahagirdar, Madhu
All,

We are using Spark Streaming to receive data from twitter stream.  This is 
running behind proxy. We have done the following configurations inside spark 
steaming for twitter4j to work behind proxy.

def main(args: Array[String]) {
val filters =  Array(Modi)

System.setProperty(twitter4j.oauth.consumerKey, *)
System.setProperty(twitter4j.oauth.consumerSecret, *)
System.setProperty(twitter4j.oauth.accessToken, *)
System.setProperty(twitter4j.oauth.accessTokenSecret, *)
System.setProperty(twitter4j.http.proxyHost, X.X.X.X);
System.setProperty(twitter4j.http.proxyPort, );
System.setProperty(twitter4j.http.useSSL, true);

val conf = new SparkConf().setAppName(TwitterPopularTags)

val ssc = new StreamingContext(conf, Seconds(60))
val stream = TwitterUtils.createStream(ssc, None, filters)

stream.print()

ssc.start()
ssc.awaitTermination()
  }

spark-streaming-twitter_2.10-1.1.0
twitter4j-core-3.0.3.jar
twitter4j-stream-3.0.3.jar


When the spark job is run with local[2], running on a single node and not on 
cluster, with the same settings above it is able to pull the data and it works 
like charm behind proxy.

The same code when run on a cluster (below) on the same network with the above 
settings it is throwing the below error. Not sure what is going wrong. Any help 
is appreciated. We checked that environment variables of executors, all the 
above system properties are set.

bin/spark-submit --class SparkTwitter2Kafka --master spark://IPADDRESS:7077 
spark-twitter.jar

14/10/13 14:00:10 ERROR scheduler.ReceiverTracker: Deregistered receiver for 
stream 0: Restarting receiver with delay 2000ms: Error receiving tweets - 
connect timed out
Relevant discussions can be found on the Internet at:
http://www.google.co.jp/search?q=944a924a or
http://www.google.co.jp/search?q=24fd66dc
TwitterException{exceptionCode=[944a924a-24fd66dc 944a924a-24fd66b2], 
statusCode=-1, message=null, code=-1, retryAfter=-1, rateLimitStatus=null, 
version=3.0.5}
at 
twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:177)
at 
twitter4j.internal.http.HttpClientWrapper.request(HttpClientWrapper.java:61)
at 
twitter4j.internal.http.HttpClientWrapper.post(HttpClientWrapper.java:98)
at 
twitter4j.TwitterStreamImpl.getFilterStream(TwitterStreamImpl.java:304)
at 
twitter4j.TwitterStreamImpl$7.getStream(TwitterStreamImpl.java:292)
at 
twitter4j.TwitterStreamImpl$TwitterStreamConsumer.run(TwitterStreamImpl.java:462)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at 
sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at 
sun.net.www.protocol.https.HttpsClient.init(HttpsClient.java:275)
at 
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at 
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:250)
at 
twitter4j.internal.http.HttpClientImpl.request(HttpClientImpl.java:135)
... 5 more






The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Given that I have multiple worker nodes and when Spark schedules the job again 
on the worker nodes that are alive, does it then again store the data in 
elastic search and then flume or does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar


From: Akhil Das [ak...@sigmoidanalytics.com]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker 
nodes and in that case if one worker node goes down, then spark will try to 
recompute those lost RDDs again with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu 
madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote:
In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




RE: Dstream Transformations

2014-10-06 Thread Jahagirdar, Madhu
Doesn't spark keep track of the DAG lineage and start from where it has stopped 
? Does it have to always start from the beginning of the lineage when the job 
fails ?


From: Massimiliano Tomassi [max.toma...@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To: Jahagirdar, Madhu
Cc: Akhil Das; user
Subject: Re: Dstream Transformations

From the Spark Streaming Programming Guide 
(http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node):

...output operations (like foreachRDD) have at-least once semantics, that is, 
the transformed data may get written to an external entity more than once in 
the event of a worker failure.

I think that when a worker fails the entire graph of transformations/actions 
will be reapplied again on that RDD. This means that, in your case, both the 
storing operations will be executed again. For this reason, in a video I've 
watched on youtube, they suggest to make all the output operations idempotent. 
Obviously not always this is possible unfortunately: e.g. you are building an 
analytics system and you need to increment counters.

This is what I've got so far, anyone having a different point of view?

On 6 October 2014 08:59, Jahagirdar, Madhu 
madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote:
Given that I have multiple worker nodes and when Spark schedules the job again 
on the worker nodes that are alive, does it then again store the data in 
elastic search and then flume or does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar


From: Akhil Das [ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker 
nodes and in that case if one worker node goes down, then spark will try to 
recompute those lost RDDs again with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu 
madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote:
In my spark streaming program I have created kafka utils to receive data and 
store data in elastic search and in flume. Storing function is applied on same 
dstream. My question what is the behavior of spark if after storing data in 
elastic search the worker node dies before storing in flume? Does it  restart 
worker and then again store the data in elastic search and then flume or does 
it only run functions to store in flume.

Regards
Madhu Jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





--

Massimiliano Tomassi

web: http://about.me/maxtomassi
e-mail: max.toma...@gmail.commailto:max.toma...@gmail.com



hdfs short circuit

2014-07-03 Thread Jahagirdar, Madhu
can i enable spark to use dfs.client.read.shortcircuit property to improve 
performance and ready natively on local nodes instead of hdfs api ?


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.