Checkpointing calls the job twice?

2015-10-17 Thread jatinganhotra
Hi, I noticed that when you checkpoint a given RDD, it results in performing the action twice as I can see 2 jobs being executed in the Spark UI. Example: val logFile = "/data/pagecounts" sc.setCheckpointDir("/checkpoints") val logData = sc.textFile(logFile, 2) val as = logData.filter(line =>

Re: HBase Spark Streaming giving error after restore

2015-10-17 Thread Amit Hora
Hi, Regresta for delayed resoonse please find below full stack trace ava.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.hadoop.hbase.client.Mutation at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85) at

Re: HBase Spark Streaming giving error after restore

2015-10-17 Thread Aniket Bhatnagar
Can you try changing classOf[OutputFormat[String, BoxedUnit]] to classOf[OutputFormat[String, Put]] while configuring hconf? On Sat, Oct 17, 2015, 11:44 AM Amit Hora wrote: > Hi, > > Regresta for delayed resoonse > please find below full stack trace > >

repartition vs partitionby

2015-10-17 Thread shahid qadri
Hi folks I need to reparation large set of data around(300G) as i see some portions have large data(data skew) i have pairRDDs [({},{}),({},{}),({},{})] what is the best way to solve the the problem - To unsubscribe, e-mail:

PySpark: breakdown application execution time and fine-tuning

2015-10-17 Thread saluc
Hello, I am using PySpark to develop my big-data application. I have the impression that most of the execution of my application is spent on the infrastructure (distributing the code and the data in the cluster, IPC between the Python processes and the JVM) rather than on the computation itself.

Re: Spark on Mesos / Executor Memory

2015-10-17 Thread Bharath Ravi Kumar
David, Tom, Thanks for the explanation. This confirms my suspicion that the executor was holding on to memory regardless of tasks in execution once it expands to occupy memory in keeping with spark.executor.memory. There certainly is scope for improvement here, though I realize there will

can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, Can spark be used as an alternative to gem fire cache ? we use gem fire cache to save (cache) dimension data in memory which is later used by our Java custom made ETL tool can I do something like below ? can I cache a RDD in memory for a whole day ? as of I know RDD will get empty once

Re: How to have Single refernce of a class in Spark Streaming?

2015-10-17 Thread Deenar Toraskar
Swetha Look at http://spark.apache.org/docs/latest/programming-guide.html#shared-variables Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works

Re: Spark on Mesos / Executor Memory

2015-10-17 Thread Bharath Ravi Kumar
To be precise, the MesosExecutorBackend's Xms & Xmx equal spark.executor.memory. So there's no question of expanding or contracting the memory held by the executor. On Sat, Oct 17, 2015 at 5:38 PM, Bharath Ravi Kumar wrote: > David, Tom, > > Thanks for the explanation. This

Re: repartition vs partitionby

2015-10-17 Thread Raghavendra Pandey
You can use coalesce function, if you want to reduce the number of partitions. This one minimizes the data shuffle. -Raghav On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri wrote: > Hi folks > > I need to reparation large set of data around(300G) as i see some portions >

Re: s3a file system and spark deployment mode

2015-10-17 Thread Raghavendra Pandey
You can add classpath info in hadoop env file... Add the following line to your $HADOOP_HOME/etc/hadoop/hadoop-env.sh export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/* Add the following line to $SPARK_HOME/conf/spark-env.sh export

Re: Complex transformation on a dataframe column

2015-10-17 Thread Raghavendra Pandey
Here is a quick code sample I can come up with : case class Input(ID:String, Name:String, PhoneNumber:String, Address: String) val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789", "houseNo:StreetNo:City:State:Zip"))).toDF() val formatAddress = udf { (s: String) =>

Re: repartition vs partitionby

2015-10-17 Thread shahid ashraf
yes i know about that,its in case to reduce partitions. the point here is the data is skewed to few partitions.. On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > You can use coalesce function, if you want to reduce the number of > partitions. This one

Re: can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread Ndjido Ardo Bar
Hi Kali, If I do understand you well, Tachyon ( http://tachyon-project.org) can be good alternative. You can use Spark Api to load and persist data into Tachyon. Hope that will help. Ardo > On 17 Oct 2015, at 15:28, "kali.tumm...@gmail.com" > wrote: > > Hi All, >

Output println info in LogMessage Info ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, I n Unix I can print some warning or info using LogMessage WARN "Hi All" or LogMessage INFO "Hello World" is there similar thing in Spark ? Imagine I wan to print count of RDD in Logs instead of using Println Thanks Sri -- View this message in context:

Spark Streaming scheduler delay VS driver.cores

2015-10-17 Thread Adrian Tanase
Hi, I’ve recently bumped up the resources for a spark streaming job – and the performance started to degrade over time. it was running fine on 7 nodes with 14 executor cores each (via Yarn) until I bumped executor.cores to 22 cores/node (out of 32 on AWS c3.xlarge, 24 for yarn) The driver has

Re: repartition vs partitionby

2015-10-17 Thread Adrian Tanase
If the dataset allows it you can try to write a custom partitioner to help spark distribute the data more uniformly. Sent from my iPhone On 17 Oct 2015, at 16:14, shahid ashraf > wrote: yes i know about that,its in case to reduce partitions. the

Should I convert json into parquet?

2015-10-17 Thread Gavin Yue
I have json files which contains timestamped events. Each event associate with a user id. Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label

Re: Problem installing Sparck on Windows 8

2015-10-17 Thread Marco Mistroni
HI still having issues in installing spark on windows 8 the spark web console runs successfully.. i can run spark pi example, however wheni run spark-shell i am getting the following exception java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /t mp/hive on HDFS should