Debug what is replication Level of which RDD

2016-01-23 Thread gaurav sharma
Hi All, I have enabled replication for my RDDs. I see on the Storage tab of the Spark UI, which mentions the replication level 2x or 1x. But the names given are mappedRDD, shuffledRDD, I am not able to debug which of my RDD is 2n replicated, and which one is 1x. Please help. Regards, Gaurab

Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-23 Thread gaurav sharma
Hi Tathagata/Cody, I am facing a challenge in Production with DAG behaviour during checkpointing in spark streaming - Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ 100 GB of data Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise processing - c

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread gaurav sharma
AM, Augustus Hong wrote: > How did you specify number of cores each executor can use? > > Be sure to use this when submitting jobs with spark-submit: > *--total-executor-cores > 100.* > > Other options won't work from my experience. > > On Fri, Oct 23, 2015 at

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread gaurav sharma
Hi, I created 2 workers on same machine each with 4 cores and 6GB ram I submitted first job, and it allocated 2 cores on each of the worker processes, and utilized full 4 GB ram for each executor process When i submit my second job it always say in WAITING state. Cheers!! On Tue, Oct 20, 20

Re: Worker Machine running out of disk for Long running Streaming process

2015-09-15 Thread gaurav sharma
dically (say every 10 mins) run System.gc() on the driver. >> The cleaning up shuffles is tied to the garbage collection. >> >> >> On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma >> wrote: >> >>> Hi All, >>> >>> >>> I have a

Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread gaurav sharma
Hi All, I have a 24x7 running Streaming Process, which runs on 2 hour windowed data The issue i am facing is my worker machines are running OUT OF DISK space I checked that the SHUFFLE FILES are not getting cleaned up. /log/spark-2b875d98-1101-4e61-86b4-67c9e71954cc/executor-5bbb53c1-cee9-4438

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1 -

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return l

Re: createDirectStream and Stats

2015-07-12 Thread gaurav sharma
Hi guys, I too am facing similar challenge with directstream. I have 3 Kafka Partitions. and running spark on 18 cores, with parallelism level set to 48. I am running simple map-reduce job on incoming stream. Though the reduce stage takes milliseconds-seconds for around 15 million packets, the

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread gaurav sharma
ite to /opt/ on that machine as its one machine always > throwing up. > > Thanks > Best Regards > > On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma > wrote: > >> Hi All, >> >> I am facing this issue in my production environment. >> >> My worke

Worker dies with java.io.IOException: Stream closed

2015-07-11 Thread gaurav sharma
Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception. But i see the space is available on all the partitions on my disk I did NOT see any abrupt increase in DIsk IO, which might have choked the executor to write on to the stderr file. But still m

Re: How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread gaurav sharma
When you submit a job, spark breaks down it into stages, as per DAG. the stages run transformations or actions on the rdd's. Each rdd constitutes of N partitions. The tasks creates by spark to execute the stage are equal to the number of partitions. Every task is executed on the cored utilized by

Spark Streaming - Can i BIND Spark Executor to Kafka Partition Leader

2015-06-12 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. *Cluster details* *6 nodes*, each node running 1 Spark and kafka processes each. Node1 -> 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 -> 1 Worker prcocess each

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-12 Thread gaurav sharma
ence the broadcast variable in you workers. It will get > shipped once to all nodes in the cluster and can be referenced by them. > > HTH. > > -Todd > > On Thu, Jun 11, 2015 at 8:23 AM, gaurav sharma > wrote: > >> Hi, >> >> I am using Kafka Spark cluste

How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. Cluster details 6 nodes, each node running 1 Spark and kafka processes each. Node1 -> 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 -> 1 Worker prcocess each