Setting up Spark with YARN on EC2 cluster

2015-02-25 Thread Harika
Hi, I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was reading this document and I understand that Hadoop has to be setup for running Spark with YARN. My questions - 1. Do we have to setup Hadoop cluster on EC2

What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

2015-02-25 Thread kshekhram
Hello Spark experts I have tried reading spark documentation and searched many posts in this forum but I couldn't find satisfactory answer to my question. I have recently started using spark, so I may be missing something and that's why I am looking for your guidance here. I have a situation

Re: Spark cluster set up on EC2 customization

2015-02-25 Thread Akhil Das
You can easily add a function (say setup_pig) inside the function setup_cluster in this script Thanks Best Regards On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak wrote: > Hi, > > I was looking at the documentation for deploying

RE: group by order by fails

2015-02-25 Thread Tridib Samanta
Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order by fails From: ak...@sigmoidanalytics.com To: tridib.sama...@live.com CC: user@spark.apache.org Which version of spark are you having? It seems there was a similar Jira

Re: Number of parallel tasks

2015-02-25 Thread Akhil Das
Did you try setting .set("spark.cores.max", "20") Thanks Best Regards On Wed, Feb 25, 2015 at 10:21 PM, Akshat Aranya wrote: > I have Spark running in standalone mode with 4 executors, and each > executor with 5 cores each (spark.executor.cores=5). However, when I'm > processing an RDD with ~9

Re: Scheduler hang?

2015-02-25 Thread Akhil Das
What operation are you trying to do and how big is the data that you are operating on? Here's a few things which you can try: - Repartition the RDD to a higher number than 222 - Specify the master as local[*] or local[10] - Use Kryo Serializer (.set("spark.serializer", "org.apache.spark.serialize

Re: group by order by fails

2015-02-25 Thread Akhil Das
Which version of spark are you having? It seems there was a similar Jira https://issues.apache.org/jira/browse/SPARK-2474 Thanks Best Regards On Thu, Feb 26, 2015 at 12:03 PM, tridib wrote: > Hi, > I need to find top 10 most selling samples. So query looks like: > select s.name, count(s.name)

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-25 Thread Tristan Blakers
I get the same exception simply by doing a large broadcast of about 6GB. Note that I’m broadcasting a small number (~3m) of fat objects. There’s plenty of free RAM. This and related kryo exceptions seem to crop-up whenever an object graph of more than a couple of GB gets passed around. at

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-25 Thread Tathagata Das
Spark Streaming has a new Kafka direct stream, to be release as experimental feature with 1.3. That uses a low level consumer. Not sure if it satisfies your purpose. If you want more control, its best to create your own Receiver with the low level Kafka API. TD On Tue, Feb 24, 2015 at 12:09 AM, b

group by order by fails

2015-02-25 Thread tridib
Hi, I need to find top 10 most selling samples. So query looks like: select s.name, count(s.name) from sample s group by s.name order by count(s.name) This query fails with following error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [COUNT(name#0) ASC], true

Re: Fwd: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
I created an issue and pull request. Discussion can continue there: https://issues.apache.org/jira/browse/SPARK-6029 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-excludes-fastutil-dependencies-we-need-tp21812p21814.html Sent from the Apache Spa

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread anamika gupta
I am now getting the following error. I cross-checked my types and corrected three of them i.e. r26-->String, r27-->Timestamp, r28-->Timestamp. This error still persists. scala> sc.textFile("/home/cdhuser/Desktop/Sdp_d.csv").map(_.split(",")).map { r => | val upto_time = sdf.parse(r(23).trim)

Re: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Ted Yu
Maybe drop the exclusion for parquet-provided profile ? Cheers On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner wrote: > Inline > > On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu wrote: > >> Interesting. Looking at SparkConf.scala : >> >> val configs = Seq( >> DeprecatedConfig("spark.files.use

Fwd: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
Forwarding conversation below that didn't make it to the list. -- Forwarded message -- From: Jim Kleckner Date: Wed, Feb 25, 2015 at 8:42 PM Subject: Re: Spark excludes "fastutil" dependencies we need To: Ted Yu Cc: Sean Owen , user Inline On Wed, Feb 25, 2015 at 1:53 PM, Ted

Scheduler hang?

2015-02-25 Thread Victor Tso-Guillen
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local mode with parallelism at 8. I have 222 tasks and I never seem to get far past 40. Usually in the 20s to 30s it will just hang. The last logging is below, and a screenshot of the UI. 2015-02-25 20:39:55.779 GMT-0800 INFO [task

Re: Executor lost with too many temp files

2015-02-25 Thread Raghavendra Pandey
Can you try increasing the ulimit -n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier wrote: > Hi Sameer, > > I’m still using Spark 1.1.1, I think the default is hash shuffle. No > external shuffle service. > > We are processing gzipped JSON files, the partitions are the amount

Re: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
Inline On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu wrote: > Interesting. Looking at SparkConf.scala : > > val configs = Seq( > DeprecatedConfig("spark.files.userClassPathFirst", > "spark.executor.userClassPathFirst", > "1.3"), > DeprecatedConfig("spark.yarn.user.classpath.fir

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964
Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand some internal information of spark. From the web and this list, I keep seeing people talking about increase the parallelism if you get the OOM error. I tried to read document as much as possible to understand the RDD pa

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore < thanigai.vell...@gmail.com> wrote: > It appears that the function immediately returns even before the > foreachrdd stage is executed. Is that possible? > Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not p

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Hi Reza, With 40 nodes and shuffle space managed by YARN over HDFS usercache we could run the similarity job without doing any thresholding...We used hash based shuffle and sort hopefully will further improve it...Note that this job was almost 6M x 1.5M We will go towards 50 M x ~ 3M columns and

Re: spark standalone with multiple executors in one work node

2015-02-25 Thread bit1...@163.com
My understanding is if you run multi applications on the work node, then each application will have an executorbackend process and an executor as well. bit1...@163.com From: Judy Nash Date: 2015-02-26 09:58 To: user@spark.apache.org Subject: spark standalone with multiple executors in one wor

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
I didn't include the complete driver code but I do run the streaming context from the main program which calls this function. Again, I can print the red elements within the foreachrdd block but the array that is returned is always empty. It appears that the function immediately returns even before

spark standalone with multiple executors in one work node

2015-02-25 Thread Judy Nash
Hello, Does spark standalone support running multiple executors in one worker node? It seems yarn has the parameter --num-executors to set number of executors to deploy, but I do not find the equivalent parameter in spark standalone. Thanks, Judy

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tathagata Das
You are just setting up the computation here using foreacRDD. You have not even run the streaming context to get any data. On Wed, Feb 25, 2015 at 2:21 PM, Thanigai Vellore < thanigai.vell...@gmail.com> wrote: > I have this function in the driver program which collects the result from > rdds (in

Re: throughput in the web console?

2015-02-25 Thread Tathagata Das
Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J w

Spark cluster set up on EC2 customization

2015-02-25 Thread Sameer Tilak
Hi, I was looking at the documentation for deploying Spark cluster on EC2. http://spark.apache.org/docs/latest/ec2-scripts.html We are using Pig to build the data pipeline and then use MLLib for analytics. I was wondering if someone has any experience to include additional tools/services such

Re: Upgrade to Spark 1.2.1 using Guava

2015-02-25 Thread Marcelo Vanzin
Guava is not in Spark. (Well, long version: it's in Spark but it's relocated to a different package except for some special classes leaked through the public API.) If your app needs Guava, it needs to package Guava with it (e.g. by using maven-shade-plugin, or using "--jars" if only executors use

Upgrade to Spark 1.2.1 using Guava

2015-02-25 Thread Pat Ferrel
The root Spark pom has guava set at a certain version number. It’s very hard to read the shading xml. Someone suggested that I try using userClassPathFirst but that sounds too heavy handed since I don’t really care which version of guava I get, not picky. When I set my project to use the same

RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

2015-02-25 Thread Cheng, Hao
If in that case, I suggest you need to use “order by” instead of the “sort by” for Spark SQL if you think the sort result is very important to you. If not the case (reducer count > 1), I didn’t see any reason that Spark SQL should output the same result as Hive does, as they have totally differe

RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

2015-02-25 Thread Cheng, Hao
How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1. From: Kannan Rajah [mailto:kra...@maprtech.com] Sent: Thursday, February 26, 2015 3:02 AM To: Cheng Lian Cc: user@spark.apache.org Subject: Re: Spark-SQL 1.2.0 "sort

Error when running the terasort branche in a cluster

2015-02-25 Thread Tom
Not sure if this is the place to ask, but i am using the terasort branche of Spark for benchmarking, as found on https://github.com/ehiggs/spark/tree/terasort, and I get the error below when running on two machines (one machine works just fine). When looking at the code, listed below the error mess

Re: upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
I pass in my own dependencies jar with the class in it when creating the context. I’ve verified that the jar is in the list and checked in the jar to find guava. This should work, right so I must have made a mistake in mu checking. On Feb 25, 2015, at 3:40 PM, Ted Yu wrote: Could this be caus

Re: upgrade to Spark 1.2.1

2015-02-25 Thread Ted Yu
Could this be caused by Spark using shaded Guava jar ? Cheers On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel wrote: > Getting an error that confuses me. Running a largish app on a standalone > cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. > With Spark 1.1.0 I simply re

Re: Standalone spark

2015-02-25 Thread boci
Thanks dude... I think I will pull up a docker container for integration test -- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Feb 26, 2015 at 12:22 AM, Sean Owen

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2015-02-25 Thread tridib
Using Hivecontext solved it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1.1.0 I simply registered the class and its serializer with kryo like this: kryo.register(classOf[com.google.common.collect.HashBiMap

Re: Standalone spark

2015-02-25 Thread Sean Owen
Yes, been on the books for a while ... https://issues.apache.org/jira/browse/SPARK-2356 That one just may always be a known 'gotcha' in Windows; it's kind of a Hadoop gotcha. I don't know that Spark 100% works on Windows and it isn't tested on Windows. On Wed, Feb 25, 2015 at 11:05 PM, boci wrote

Re: Standalone spark

2015-02-25 Thread boci
Thanks your fast answer... in windows it's not working, because hadoop (surprise suprise) need winutils.exe. Without this it's not working, but if you not set the hadoop directory You simply get 15/02/26 00:03:16 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.I

job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-25 Thread Darin McBeath
I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8 r3.8xlarge machines but limit the job to only 128 cores. I have also tried other things such as setting 4 workers per r3.8xlarge and 67gb each but this made no difference. The job frequently fails at the end in this step (sa

Considering Spark for large data elements

2015-02-25 Thread Rob Sargent
I have an application which might benefit from Sparks distribution/analysis, but I'm worried about the size and structure of my data set. I need to perform several thousand simulation on a rather large data set and I need access to all the generated simulations. The data element is largely in

Re: Standalone spark

2015-02-25 Thread Sean Owen
Spark and Hadoop should be listed as 'provided' dependency in your Maven or SBT build. But that should make it available at compile time. On Wed, Feb 25, 2015 at 10:42 PM, boci wrote: > Hi, > > I have a little question. I want to develop a spark based application, but > spark depend to hadoop-cli

Standalone spark

2015-02-25 Thread boci
Hi, I have a little question. I want to develop a spark based application, but spark depend to hadoop-client library. I think it's not necessary (spark standalone) so I excluded from sbt file.. the result is interesting. My trait where I create the spark context not compiled. The error: ... scal

Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
I have this function in the driver program which collects the result from rdds (in a stream) into an array and return. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? I can print the RDD values inside the foreachRDD call b

How to pass a org.apache.spark.rdd.RDD in a recursive function

2015-02-25 Thread dritanbleco
Hello i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a recursive function. This table should be changed in any step of the recursion and could not be just a global var need help :) Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
Look at the trace again. It is a very weird error. The SparkSubmit is running on client side, but YarnClusterSchedulerBackend is supposed in running in YARN AM. I suspect you are running the cluster with yarn-client mode, but in JavaSparkContext you set "yarn-cluster”. As a result, spark contex

Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran, Thanks for your response. I noticed the "intersection" and "subtract" methods for a RDD, does they work based on hash off all the fields in a RDD record ? - Himanish On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid wrote: > the more scalable alternative is to do a join (or a variant

Re: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Ted Yu
Interesting. Looking at SparkConf.scala : val configs = Seq( DeprecatedConfig("spark.files.userClassPathFirst", "spark.executor.userClassPathFirst", "1.3"), DeprecatedConfig("spark.yarn.user.classpath.first", null, "1.3", "Use spark.{driver,executor}.userClassPathFi

Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
Hi Mate, When you initialize the JavaSparkContext, you don’t need to specify the mode “yarn-cluster”. I suspect that is the root cause. Thanks. Zhan Zhang On Feb 25, 2015, at 10:12 AM, gulyasm mailto:mgulya...@gmail.com>> wrote: JavaSparkContext.

Re: Hamburg Apache Spark Meetup

2015-02-25 Thread Petar Zecevic
Please add the Zagreb Meetup group, too. http://www.meetup.com/Apache-Spark-Zagreb-Meetup/ Thanks! On 18.2.2015. 19:46, Johan Beisser wrote: If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it. http://www.meetup.com/Hamburg-Apache-Spark-Meetup/ On Tue, Feb 17, 2015 at 5

Re: throughput in the web console?

2015-02-25 Thread Otis Gospodnetic
Hi Josh, SPM will show you this info. I see you use Kafka, too, whose numerous metrics you can also see in SPM side by side with your Spark metrics. Sounds like trends is what you are after, so I hope this helps. See http://sematext.com/spm Otis > On Feb 24, 2015, at 11:59, Josh J wrote:

Re: Unable to run hive queries inside spark

2015-02-25 Thread Michael Armbrust
It looks like that is getting interpreted as a local path. Are you missing a core-site.xml file to configure hdfs? On Tue, Feb 24, 2015 at 10:40 PM, kundan kumar wrote: > Hi Denny, > > yes the user has all the rights to HDFS. I am running all the spark > operations with this user. > > and my hi

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Slim Baltagi
Hi all Here is another Spark talk (a vendor-independent one!) that you might have missed: 'The Future of Apache Hadoop' track: How Spark and Flink are shaping the future of Hadoop? https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074410 Regards, Slim B

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

2015-02-25 Thread Kannan Rajah
Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian wrote: > (Move to user list.) > > Hi Kannan, > > You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this > line of code >

Re: Getting to proto buff classes in Spark Context

2015-02-25 Thread necro351 .
Thanks for your response and suggestion, Sean. Setting "spark.files.userClassPathFirst" didn't fix the problem for me. I am not very familiar with the Spark and Scala environment, so please correct any incorrect assumptions or statements I make. However, I don't believe this to be a classpath visi

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin wrote: > Hi all, > > The Hadoop Summit uses community choice voting to decide which talks to > feature. It would be great if the community could help vote for S

NullPointerException in ApplicationMaster

2015-02-25 Thread gulyasm
Hi all, I am trying to run a Spark Java application on EMR, but I keep getting NullPointerException from the Application master (spark version on EMR: 1.2). The stacktrace is below. I also tried to run the application on Hortonworks Sandbox (2.2) with spark 1.2, following the blogpost (http://hort

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m and I made sparsity pattern 100:1.5M..we would like to increase the sparsity pattern to 1000:1.5M I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort shuffle will help.. I read in Reza paper that ov

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Imran Rashid
Hi Yiannis, Broadcast variables are meant for *immutable* data. They are not meant for data structures that you intend to update. (It might *happen* to work when running local mode, though I doubt it, and it would probably be a bug if it did. It will certainly not work when running on a cluster

Spark Standard Application to Test

2015-02-25 Thread danilopds
Hello, I am preparing some tests to execute in Spark in order to manipulate properties and check the variations in results. For this, I need to use a Standard Application in my environment like the well-known apps to Hadoop: Terasort

Number of parallel tasks

2015-02-25 Thread Akshat Aranya
I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I be getting 4x5=20 parallel task executions?

NegativeArraySizeException when doing joins on skewed data

2015-02-25 Thread soila
I have been running into NegativeArraySizeException's when doing joins on data with very skewed key distributions in Spark 1.2.0. I found a previous post that mentioned that this exception arises when the size of the blocks spilled during the shuffle exceeds 2GB. The post recommended increasing the

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
By throughput you mean Number of events processed etc? [image: Inline image 1] Streaming tab already have these statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: > > On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das > wrote: > >> For SparkStreaming applications, there

Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das wrote: > For SparkStreaming applications, there is already a tab called "Streaming" > which displays the basic statistics. Would I just need to extend this tab to add the throughput?

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
For SparkStreaming applications, there is already a tab called "Streaming" which displays the basic statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 8:55 PM, Josh J wrote: > Let me ask like this, what would be the easiest way to display the > throughput in the web console? Would I need t

Re: throughput in the web console?

2015-02-25 Thread Josh J
Let me ask like this, what would be the easiest way to display the throughput in the web console? Would I need to create a new tab and add the metrics? Any good or simple examples showing how this can be done? On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das wrote: > Did you have a look at > > > http

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska
I am not sure if your issue is setting the Fair mode correctly or something else so let's start with the FAIR mode. Do you see scheduler mode actually being set to FAIR: I have this line in spark-defaults.conf spark.scheduler.allocation.file=/spark/conf/fairscheduler.xml Then, when I start my ap

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-25 Thread Mukesh Jha
My application runs fine for ~3/4 hours and then hits this issue. On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha wrote: > Hi Experts, > > My Spark Job is failing with below error. > > From the logs I can see that input-3-1424842351600 was added at 5:32:32 > and was never purged out of memory. Also

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Yiannis Gkoufas
What I think is happening that the map operations are executed concurrently and the map operation in rdd2 has the initial copy of myObjectBroadcated. Is there a way to apply the transformations sequentially? First materialize rdd1 and then rdd2. Thanks a lot! On 24 February 2015 at 18:49, Yiannis

Number of Executors per worker process

2015-02-25 Thread Spico Florin
Hello! I've read the documentation about the spark architecture, I have the following questions: 1: how many executors can be on a single worker process (JMV)? 2:Should I think executor like a Java Thread Executor where the pool size is equal with the number of the given cores (set up by the SPARK

Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
I tried val pairVarOriRDD = sc.newAPIHadoopFile(path, classOf[NetCDFFileInputFormat].asSubclass( classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[WRFIndex,WRFVariable]]), classOf[WRFIndex], classOf[WRFVariable], jobConf) The compiler does not compla

Spark NullPointerException

2015-02-25 Thread Máté Gulyás
Hi all, I am trying to run a Spark Java application on EMR, but I keep getting NullPointerException from the Application master (spark version on EMR: 1.2). The stacktrace is below. I also tried to run the application on Hortonworks Sandbox (2.2) with spark 1.2, following the blogpost (http://hort

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Sean Owen
Then every worker would have to hold the whole RDD in memory. That's got some significant drawbacks. As long as you are able to execute all tasks locally to their partition, any additional copies of the data don't help locality. And you need far less than N copies of the data for that in general.

How to efficiently control concurrent Spark jobs

2015-02-25 Thread Staffan
Hi, Is there a good way (recommended way) to control and run multiple Spark jobs within the same application? My application is like follows; 1) Run one Spark job on a 'ful' dataset, which then creates a few thousands of RDDs containing sub-datasets from the complete dataset. Each of the sub-datas

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Yes. Effectively, could it avoid network transfers? Or put differently, would an option like persist(MEMORY_ALL) improve job speed by caching an RDD on every worker? > On 25.02.2015, at 11:42, Sean Owen wrote: > > If you mean, can both copies of the blocks be used for computations? > yes they

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Sean Owen
If you mean, can both copies of the blocks be used for computations? yes they can. On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier wrote: > Hi, > > just a quick question about calling persist with the _2 option. Is the 2x > replication only useful for fault tolerance, or will it also increase j

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks --

Re: method newAPIHadoopFile

2015-02-25 Thread Sean Owen
OK, from the declaration you sent me separately: public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat It looks like you do not declare any generic types that FileInputForm

Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
This is the declaration of my custom inputformat public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat Best, Patcharee On 25. feb. 2015 10:15, patcharee wrote: Hi, I

Re: No executors allocated on yarn with latest master branch

2015-02-25 Thread Anders Arpteg
We're using the capacity scheduler, to the best of my knowledge. Unsure if multi resource scheduling is used, but if you know of an easy way to figure that out, then let me know. Thanks, Anders On Sat, Feb 21, 2015 at 12:05 AM, Sandy Ryza wrote: > Are you using the capacity scheduler or fifo sc

method newAPIHadoopFile

2015-02-25 Thread patcharee
Hi, I am new to spark and scala. I have a custom inputformat (used before with mapreduce) and I am trying to use it in spark. In java api (the syntax is correct): JavaPairRDD pairVarOriRDD = sc.newAPIHadoopFile( path, NetCDFFileInputFormat.class, WRFIndex.c

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Harika Matha
Hi Yana, I tried running the program after setting the property "spark.scheduler.mode" to FAIR. But the result is same as previous. Are there any other properties that have to be set? On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska wrote: > It's hard to tell. I have not run this on EC2 but thi

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Akhil Das
It says sdp_d not found, since it is a class you need to instantiate it once. like: sc.textFile("derby.log").map(_.split(",")).map( r => { val upto_time = sdf.parse(r(23).trim); calendar.setTime(upto_time); val r23 = new java.sql.Timestamp(upto_time.getTime);

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread anamika gupta
The link has proved helpful. I have been able to load data, register it as a table and perform simple queries. Thanks Akhil !! Though, I still look forward to knowing where I was going wrong with my previous technique of extending the Product Interface to overcome case class's limit of 22 fields.

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Petar Zecevic
I believe your class needs to be defined as a case class (as I answered on SO).. On 25.2.2015. 5:15, anamika gupta wrote: Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema? On

Re: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Sean Owen
No, we should not add fastutil back. It's up to the app to bring dependencies it needs, and that's how I understand this issue. The question is really, how to get the classloader visibility right. It depends on where you need these classes. Have you looked into spark.files.userClassPathFirst and sp

Re: spark streaming: stderr does not roll

2015-02-25 Thread Sean Owen
These settings don't control what happens to stderr, right? stderr is up to the process that invoked the driver to control. You may wish to configure log4j to log to files instead. On Wed, Nov 12, 2014 at 8:15 PM, Nguyen, Duc wrote: > I've also tried setting the aforementioned properties using >

RE: used cores are less then total no. of core

2015-02-25 Thread Somnath Pandeya
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺ From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, February 25, 2015 12:48 PM To: Somnath Pandeya Cc: user@spark.apache.org Subject: Re: used cores are less then total no. of core You can set the following in

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener And for Streaming: https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Thanks Best Regards On Tue, Feb 24, 2015