Re: How to stop a running SparkContext in the proper way?

2014-06-04 Thread Akhil Das
ctrl + z will stop the job from being executed ( If you do a *fg/bg *you can resume the job). You need to press ctrl + c to terminate the job! Thanks Best Regards On Wed, Jun 4, 2014 at 10:24 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I want to know how I can stop a running

Re: ZeroMQ Stream - stack guard problem and no data

2014-06-04 Thread Prashant Sharma
Hi, What is your Zeromq version ? It is known to work well with 2.2 an output of `sudo ldconfig -v | grep zmq` would helpful in this regard. Thanks Prashant Sharma On Wed, Jun 4, 2014 at 11:40 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I am trying to use Spark Streaming (1.0.0)

SocketException when reading from S3 (s3n format)

2014-06-04 Thread yuzeh
Hi all, I've set up a 4-node spark cluster (the nodes are r3.large) with the spark-ec2 script. I've been trying to run a job on this cluster, and I'm trying to figure out why I get the following exception: java.net.SocketException: Connection reset at

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread yuzeh
I should add that I'm using spark 0.9.1. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SocketException-when-reading-from-S3-s3n-format-tp6889p6890.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: mounting SSD devices of EC2 r3.8xlarge instances

2014-06-04 Thread Han JU
For SSDs in r3, maybe it's better to mount with `discard` option since it supports TRIM: What I did for r3.large: echo '/dev/xvdb /mnt ext4 defaults,noatime,nodiratime,discard 0 0' /etc/fstab mkfs.ext4 /dev/xvdb mount /dev/xvdb 2014-06-03 19:15 GMT+02:00 Matei Zaharia

IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
what does this exception mean? 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6 java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271) at

Problem understanding log message in SparkStreaming

2014-06-04 Thread nilmish
I wanted to know the meaning of the following log message when running a spark streaming job : [spark-akka.actor.default-dispatcher-18] INFO org.apache.spark.streaming.scheduler.JobScheduler - Total delay: 5.432 s for time 1401870454500 ms (execution: 0.593 s) According to my understanding,

How to change default storage levels

2014-06-04 Thread Salih Kardan
Hi I'm using Spark 0.9.1 and Shark 0.9.1. My dataset does not fit into memory I have in my cluster setup, so I want to use also disk for caching. I guess MEMORY_ONLY is the default storage level in Spark. If that's the case how could I change the storage level to MEMORY_AND_DISK in Spark?

executor idle during task schedule

2014-06-04 Thread wxhsdp
Hi, all i've observed that sometimes when the executor finishes one task, it will wait about 5 seconds to get another task to work, during the 5 seconds, the executor does nothing: cpu idle, no disk access, no network transfer. is that normal for spark? thanks! -- View this message in

compile spark 1.0.0 error

2014-06-04 Thread ch huang
hi,maillist: i try to compile spark ,but failed, here is my compile command and compile output # SPARK_HADOOP_VERSION=2.0.0-cdh4.4.0 SPARK_YARN=true sbt/sbt assembly [warn] 18 warnings found [info] Compiling 53 Scala sources and 1 Java source to

Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread Xiangrui Meng
Could you check whether the vectors have the same size? -Xiangrui On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 bluejoe2...@gmail.com wrote: what does this exception mean? 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6 java.lang.IllegalArgumentException: requirement failed

Re: ZeroMQ Stream - stack guard problem and no data

2014-06-04 Thread Sean Owen
It's complaining about the native library shipped with ZeroMQ, right? That message is the JVM complaining about how it was compiled. If so, I think it's a question for ZeroMQ? On Wed, Jun 4, 2014 at 7:10 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I am trying to use Spark Streaming

Re: RDD with a Map

2014-06-04 Thread Oleg Proudnikov
Just a thought... Are you trying to use use the RDD as a Map? On 3 June 2014 23:14, Doris Xin doris.s@gmail.com wrote: Hey Amit, You might want to check out PairRDDFunctions http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions. For your use

Re: Spark not working with mesos

2014-06-04 Thread praveshjain1991
Thanks for the reply Akhil. I created a tar.gz of created by make-distribution.sh which is accessible from all the slaves (I checked it using hadoop fs -ls /path/). Also there are no worker logs printed in $SPARK_HOME/work/ directory on the workers (which are otherwise printed if i run without

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Sean Owen
I think Mayur meant that Spark doesn't necessarily clean the closure under Java 7 -- is that true though? I didn't know of an issue there. Some anonymous class in your (?) OptimisingSort class is getting serialized, which may be fine and intentional, but it is not serializable. You haven't posted

Re: compile spark 1.0.0 error

2014-06-04 Thread Sean Owen
I am not sure if it is exposed in the SBT build, but you may need the equivalent of the 'yarn-alpha' profile from the Maven build. This older build of CDH predates the newer YARN APIs. See also https://groups.google.com/forum/#!msg/spark-users/T1soH67C5M4/CmGYV8kfRkcJ Or, use a later CDH. In

Re: Spark not working with mesos

2014-06-04 Thread Akhil Das
http://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging ​​ If you are not able to find the logs in /var/log/mesos Do check in /tmp/mesos/ and you can see your applications id and all just like in the $SPARK_HOME/work directory. Thanks Best Regards On Wed,

Re: Error related to serialisation in spark streaming

2014-06-04 Thread nilmish
The error is resolved. I was using a comparator which was not serialised because of which it was throwing the error. I have now switched to kryo serializer as it is faster than java serialser. I have set the required config conf.set(spark.serializer,

Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
Hi, I am a new spark user. Pls let me know how to handle the following scenario: I have a data set with the following fields: 1. DeviceId 2. latitude 3. longitude 4. ip address 5. Datetime 6. Mobile application name With the above data, I would like to perform the following steps: 1. Collect all

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Mayur Rustagi
I had issues around embedded functions here's what I have figured. Every inner class actually contains a field referencing the outer class. The anonymous class actually has a this$0 field referencing the outer class, and thus why Spark is trying to serialize Outer class. In the Scala API, the

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Sean Owen
static inner classes do not refer to the outer class. Often people declare them non-static by default when it's unnecessary -- a Comparator class is typically a great example. Anonymous inner classes declared inside a method are another example, but there again they can be refactored into named

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Oleg Proudnikov
It is possible if you use a cartesian product to produce all possible pairs for each IP address and 2 stages of map-reduce: - first by pairs of points to find the total of each pair and - second by IP address to find the pair for each IP address with the maximum count. Oleg On 4 June 2014

Join : Giving incorrect result

2014-06-04 Thread Ajay Srivastava
Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each time I run this code on same input. The input files are large enough to be divided in two splits. When the program runs on two workers with single core assigned to these, output is consistent

Re: Facing MetricsSystem error on Running Spark applications

2014-06-04 Thread Sean Owen
You've got a conflict in the version of Jackson that is being used: Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.module.SimpleSerializers.init(Ljava/util/List;)V Looks like you are using Jackson 2.x somewhere, but AFAIK all of the Hadoop/Spark libs are still on 1.x.

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Sean Owen
Those aren't the names of the artifacts: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22 The name is spark-streaming-twitter_2.10 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Man, this has been hard going. Six days, and I

Re: Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
thank you! 孟祥瑞 with your help i solved the problem. I constructed SparseVectors in a wrong way the first parameter of the constructor SparseVector(int size, int[] indices, double[] values) I mistaked it for the size of values 2014-06-04 bluejoe2008 From: Xiangrui Meng Date: 2014-06-04

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Nick Pentreath
@Sean, the %% syntax in SBT should automatically add the Scala major version qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct syntax for the build. I seemed to run into this issue with some missing Jackson deps, and solved it by including the jar explicitly on the driver

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Sean Owen
Ah sorry, this may be the thing I learned for the day. The issue is that classes from that particular artifact are missing though. Worth interrogating the resulting .jar file with jar tf to see if it made it in? On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote:

is there any easier way to define a custom RDD in Java

2014-06-04 Thread bluejoe2008
hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which extends RDD from scratch? It is really a hard job for developers! 2014-06-04 bluejoe2008

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Jeremy Lee
On Wed, Jun 4, 2014 at 12:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Ah, sorry to hear you had more problems. Some thoughts on them: There will always be more problems, 'tis the nature of coding. :-) I try not to bother the list until I've smacked my head against them for a few hours,

Re: spark on yarn fail with IOException

2014-06-04 Thread sam
I get a very similar stack trace and have no idea what could be causing it (see below). I've created a SO: http://stackoverflow.com/questions/24038908/spark-fails-on-big-jobs-with-java-io-ioexception-filesystem-closed 14/06/02 20:44:04 INFO client.AppClient$ClientActor: Executor updated:

Spark Usecase

2014-06-04 Thread Shahab Yunus
Hello All. I have a newbie question. We have a use case where huge amount of data will be coming in streams or micro-batches of streams and we want to process these streams according to some business logic. We don't have to provide extremely low latency guarantees but batch M/R will still be

Re: Join : Giving incorrect result

2014-06-04 Thread Cheng Lian
Hi Ajay, would you mind to synthesise a minimum code snippet that can reproduce this issue and paste it here? On Wed, Jun 4, 2014 at 8:32 PM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each

Java IO Stream Corrupted - Invalid Type AC?

2014-06-04 Thread Matt Kielo
Hi Im trying run some spark code on a cluster but I keep running into a java.io.StreamCorruptedException: invalid type code: AC error. My task involves analyzing ~50GB of data (some operations involve sorting) then writing them out to a JSON file. Im running the analysis on each of the data's ~10

Re: SocketException when reading from S3 (s3n format)

2014-06-04 Thread Nicholas Chammas
I think by default a thread can die up to 4 times before Spark considers it a failure. Are you seeing that happen? I believe that is a configurable thing, but don't know off the top of my head how to change it. I've seen this error before when reading data from a large amount of files on S3, and

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-04 Thread Nicholas Chammas
On Wed, Jun 4, 2014 at 9:35 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Oh, I went back to m1.large while those issues get sorted out. Random side note, Amazon is deprecating the m1 instances in favor of m3 instances, which have SSDs and more ECUs than their m1 counterparts.

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Andrew Ash
Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote: hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-04 Thread Sean Owen
On Wed, Jun 4, 2014 at 3:33 PM, Matt Kielo mki...@oculusinfo.com wrote: Im trying run some spark code on a cluster but I keep running into a java.io.StreamCorruptedException: invalid type code: AC error. My task involves analyzing ~50GB of data (some operations involve sorting) then writing

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Andrew Ash
nilmish, To confirm your code is using kryo, go to the web ui of your application (defaults to :4040) and look at the environment tab. If your serializer settings are there then things should be working properly. I'm not sure how to confirm that it works against typos in the setting, but you

Re: How to change default storage levels

2014-06-04 Thread Andrew Ash
You can change storage level on an individual RDD with .persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change what the default persistency level is for RDDs. Andrew On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote: Hi I'm using Spark 0.9.1 and Shark

Re: RDD with a Map

2014-06-04 Thread Amit
Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what I needed. Best, Amit On Jun 4, 2014, at 6:53, Cheng Lian lian.cs@gmail.com wrote: On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar kumarami...@gmail.com wrote: Hi Folks, I am new to spark -and this is probably

Re: RDD with a Map

2014-06-04 Thread Amit
Yes, RDD as a map of String keys and List of string as values. Amit On Jun 4, 2014, at 2:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Just a thought... Are you trying to use use the RDD as a Map? On 3 June 2014 23:14, Doris Xin doris.s@gmail.com wrote: Hey Amit, You

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov

pyspark join crash

2014-06-04 Thread Brad Miller
Hi All, I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions in the result, the join fails with the message Job aborted due to stage failure: Master removed our application:

Re: Spark not working with mesos

2014-06-04 Thread ajatix
Since $HADOOP_HOME is deprecated, try adding it to the Mesos configuration file. Add `export MESOS_HADOOP_HOME=$HADOOP_HOME to ~/.bashrc` and that should solve your error -- View this message in context:

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-04 Thread Daniel Darabos
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Hi All, I've been experiencing a very strange error after upgrade from Spark 0.9 to 1.0 - it seems that saveAsTestFile function is throwing java.lang.UnsupportedOperationException that I have never seen before.

Re: Better line number hints for logging?

2014-06-04 Thread Daniel Darabos
Oh, this would be super useful for us too! Actually wouldn't it be best if you could see the whole call stack on the UI, rather than just one line? (Of course you would have to click to expand it.) On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote: Ok, I will probably

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread ajatix
I am also getting the exact error, with the exact logs when I run Spark 1.0.0 in coarse-grained mode. Coarse grained mode works perfectly with earlier versions that I tested - 0.9.1 and 0.9.0, but seems to have undergone some modification in spark 1.0.0 -- View this message in context:

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Mark Hamstra
Are you using spark-submit to run your application? On Wed, Jun 4, 2014 at 8:49 AM, ajatix a...@sigmoidanalytics.com wrote: I am also getting the exact error, with the exact logs when I run Spark 1.0.0 in coarse-grained mode. Coarse grained mode works perfectly with earlier versions that I

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread ajatix
I'm running a manually built cluster on EC2. I have mesos (0.18.2) and hdfs (2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have spark-1.0.0 on one master and the executor file is on hdfs for the slaves. Whenever I try to launch a spark application on the cluster, it starts a task

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-04 Thread Mark Hamstra
Actually, what the stack trace is showing is the result of an exception being thrown by the DAGScheduler's event processing actor. What happens is that the Supervisor tries to shut down Spark when an exception is thrown by that actor. As part of the shutdown procedure, the DAGScheduler tries to

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Marek Wiewiorka
Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests using spark-shell as well as my application(so tested turning on coarse mode via env variable and SparkContext properties explicitly) M. 2014-06-04 18:12 GMT+02:00 ajatix

Re: using Log4j to log INFO level messages on workers

2014-06-04 Thread Shivani Rao
Hello Alex Thanks for the link. Yes creating a singleton object for logging outside the code that gets executed on the workers definitely works. The problem that i am facing though is related to configuration of the logger. I don't see any log messages in the worker logs of the application. a)

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
Thanks a lot, sorry for the really late reply! (Didn't have my laptop) This is working, but it's dreadfully slow and seems to not run in parallel? On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath nick.pentre...@gmail.com wrote: You need to use mapPartitions (or foreachPartition) to instantiate

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Patrick Wendell
Hey, thanks a lot for reporting this. Do you mind making a JIRA with the details so we can track it? - Patrick On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Patrick Wendell
Hey There, This is only possible in Scala right now. However, this is almost never needed since the core API is fairly flexible. I have the same question as Andrew... what are you trying to do with your RDD? - Patrick On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote: Just

Re: error with cdh 5 spark installation

2014-06-04 Thread Patrick Wendell
Hey Chirag, Those init scripts are part of the Cloudera Spark package (they are not in the Spark project itself) so you might try e-mailing their support lists directly. - Patrick On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote: I recently spun up an AWS cluster

Re: error with cdh 5 spark installation

2014-06-04 Thread Sean Owen
Spark is already part of the distribution, and the core CDH5 parcel. You shouldn't need extra steps unless you're doing something special. It may be that this is the very cause of the error when trying to install over the existing services. On Wed, Jun 4, 2014 at 3:19 PM, chirag lakhani

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Patrick Wendell
Hey Jeremy, The issue is that you are using one of the external libraries and these aren't actually packaged with Spark on the cluster, so you need to create an uber jar that includes them. You can look at the example here (I recently did this for a kafka project and the idea is the same):

Re: Invalid Class Exception

2014-06-04 Thread Suman Somasundar
I am building Spark by myself and I am using Java 7 to both build and run. I will try with Java 6. Thanks, Suman. On 6/3/2014 7:18 PM, Matei Zaharia wrote: What Java version do you have, and how did you get Spark (did you build it yourself by any chance or download a pre-built one)? If you

Re: Invalid Class Exception

2014-06-04 Thread Suman Somasundar
I tried building with Java 6 and also tried the pre-built packages. I am still getting the same error. It works fine when I run it on a machine with Solaris OS and X-86 architecture. But, it does not work with Solaris OS and Sparc architecture. Any ideas, why this would happen? Thanks,

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Thanks you! The regions advice solved the problem for my friend who was getting the key pair does not exist problem. I am still getting the error: ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid value 'null'

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar
chmod 600 path/FinalKey.pem Cheers k/ On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu wrote: Also, once my friend logged in to his cluster he received the error Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the other problem described. How do we

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Sam Taylor Steyer
Awesome, that worked. Thank you! - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, June 4, 2014 12:52:00 PM Subject: Re: Trouble launching EC2 Cluster with Spark chmod 600 path/FinalKey.pem Cheers k/ On Wed, Jun 4, 2014 at 12:49

Re: Join : Giving incorrect result

2014-06-04 Thread Xu (Simon) Chen
Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Hi, I am doing join of two RDDs

Re: access hdfs file name in map()

2014-06-04 Thread Xu (Simon) Chen
N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well. Thanks! On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote: I don't quite get it.. mapPartitionWithIndex takes a function that maps an

Re: Join : Giving incorrect result

2014-06-04 Thread Matei Zaharia
If this isn’t the problem, it would be great if you can post the code for the program. Matei On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a

reuse hadoop code in Spark

2014-06-04 Thread Wei Tan
Hello, I am trying to use spark in such a scenario: I have code written in Hadoop and now I try to migrate to Spark. The mappers and reducers are fairly complex. So I wonder if I can reuse the map() functions I already wrote in Hadoop (Java), and use Spark to chain them, mixing the Java

Re: reuse hadoop code in Spark

2014-06-04 Thread Matei Zaharia
Yes, you can write some glue in Spark to call these. Some functions to look at: - SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf configured by Hadoop (including InputFormat, paths, etc) - RDD.mapPartitions lets you operate in all the values on one partition (block)

Re: Better line number hints for logging?

2014-06-04 Thread Matei Zaharia
That’s a good idea too, maybe we can change CallSiteInfo to do that. Matei On Jun 4, 2014, at 8:44 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Oh, this would be super useful for us too! Actually wouldn't it be best if you could see the whole call stack on the UI, rather

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
In PySpark, the data processed by each reduce task needs to fit in memory within the Python process, so you should use more tasks to process this dataset. Data is spilled to disk across tasks. I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something we’ve

Re: How can I dispose an Accumulator?

2014-06-04 Thread Daniel Siegmann
Will the broadcast variables be disposed automatically if the context is stopped, or do I still need to unpersist()? On Sat, May 31, 2014 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hey There, You can remove an accumulator by just letting it go out of scope and it will be garbage

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread Zongheng Yang
Hi, Just wondering if you can try this: val obj = sql(select manufacturer, count(*) as examcount from pft group by manufacturer order by examcount desc) obj.collect() obj.queryExecution.executedPlan.executeCollect() and time the third line alone. It could be that Spark SQL taking some time to

Re: How can I dispose an Accumulator?

2014-06-04 Thread Matei Zaharia
All of these are disposed of automatically if you stop the context or exit the program. Matei On Jun 4, 2014, at 2:22 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: Will the broadcast variables be disposed automatically if the context is stopped, or do I still need to unpersist()?

Re: pyspark join crash

2014-06-04 Thread Brad Miller
Hi Matei, Thanks for the reply and creating the JIRA. I hear what you're saying, although to be clear I want to still state that it seems like each reduce task is loading significantly more data than just the records needed for that task. The workers seem to load all data from each block

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread ssb61
I timed the third line and here are stage timings, collect at SparkPlan.scala:52- 0.5 s mapPartitions at Exchange.scala:58 - 0.7 s RangePartitioner at Exchange.Scala:62 - 0.7 s RangePartitioner at Exchange.Scala:62 - 0.5 s

Cassandra examples don't work for me

2014-06-04 Thread Tim Kellogg
Hi, I’m following the directions to run the cassandra example “org.apache.spark.examples.CassandraTest” and I get this error Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at

Re: Running a spark-submit compatible app in spark-shell

2014-06-04 Thread Roger Hoover
It took me a little while to get back to this but it works now!! I'm invoking the shell like this: spark-shell --jars target/scala-2.10/spark-etl_2.10-1.0.jar Once inside, I can invoke a method in my package to run the job. val reseult = etl.IP2IncomeJob.job(sc) On Tue, May 27, 2014 at 8:42

Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Jeremy Freeman
Hey Matei, Wanted to let you know this issue appears to be fixed in 1.0.0. Great work! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html Sent from the Apache Spark User List mailing list

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Jeremy Lee
Thanks Patrick! Uberjars. Cool. I'd actually heard of them. And thanks for the link to the example! I shall work through that today. I'm still learning sbt and it's many options... the last new framework I learned was node.js, and I think I've been rather spoiled by npm. At least it's not

Re: pyspark join crash

2014-06-04 Thread Matei Zaharia
I think the problem is that once unpacked in Python, the objects take considerably more space, as they are stored as Python objects in a Python dictionary. Take a look at python/pyspark/join.py and combineByKey in python/pyspark/rdd.py. We should probably try to store these in serialized form.

Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar problem? .. [info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016 ... [info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly... [info] Resolving

Re: custom receiver in java

2014-06-04 Thread Tathagata Das
Yes, thanks updating this old thread! We heard our community demands and added support for Java receivers! TD On Wed, Jun 4, 2014 at 12:15 PM, lbustelo g...@bustelos.com wrote: Not that what TD was referring above, is already in 1.0.0

Re: Spark assembly error.

2014-06-04 Thread Sung Hwan Chung
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that we are trying to compile against. On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: When I run sbt/sbt assembly, I get the following exception. Is anyone else experiencing a similar

Re: Why Scala?

2014-06-04 Thread John Omernik
So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to include Python APIs in Spark Streaming? Anytime frame on this? Thanks! John On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Quite a few people ask

Re: Why Scala?

2014-06-04 Thread Matei Zaharia
We are definitely investigating a Python API for Streaming, but no announced deadline at this point. Matei On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote: So Python is used in many of the Spark Ecosystem products, but not Streaming at this point. Is there a roadmap to

Re: Why Scala?

2014-06-04 Thread John Omernik
Thank you for the response. If it helps at all: I demoed the Spark platform for our data science team today. The idea of moving code from batch testing, to Machine Learning systems, GraphX, and then to near-real time models with streaming was cheered by the team as an efficiency they would love.

Re: compile spark 1.0.0 error

2014-06-04 Thread ch huang
if i compile spark with CDH4.6 and enable yarn support , it can run on CDH4.4? On Wed, Jun 4, 2014 at 5:59 PM, Sean Owen so...@cloudera.com wrote: I am not sure if it is exposed in the SBT build, but you may need the equivalent of the 'yarn-alpha' profile from the Maven build. This older

Logistic Regression MLLib Slow

2014-06-04 Thread Srikrishna S
Hi All., I am new to Spark and I am trying to run LogisticRegression (with SGD) using MLLib on a beefy single machine with about 128GB RAM. The dataset has about 80M rows with only 4 features so it barely occupies 2Gb on disk. I am running the code using all 8 cores with 20G memory using

Re: Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread bluejoe2008
i want to use spark to handle data from non-sql databases (RDF triple store, for example) however, i am not familiar with Scala so i want to know how to create a RdfTriplesRDD rapidly 2014-06-05 bluejoe2008 From: Patrick Wendell Date: 2014-06-05 01:25 To: user Subject: Re: is there any

Using log4j.xml

2014-06-04 Thread Michael Chang
Has anyone tried to use a log4j.xml instead of a log4j.properties with spark 0.9.1? I'm trying to run spark streaming on yarn and i've set the environment variable SPARK_LOG4J_CONF to a log4j.xml file instead of a log4j.properties file, but spark seems to be using the default log4j.properties

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
Are you using the logistic_regression.py in examples/src/main/python or examples/src/main/python/mllib? The first one is an example of writing logistic regression by hand and won’t be as efficient as the MLlib one. I suggest trying the MLlib one. You may also want to check how many iterations

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Matei Zaharia
Ah, is the file gzipped by any chance? We can’t decompress gzipped files in parallel so they get processed by a single task. It may also be worth looking at the application UI (http://localhost:4040) to see 1) whether all the data fits in memory in the Storage tab (maybe it somehow becomes

Re: Why Scala?

2014-06-04 Thread Jeremy Lee
I'm still a Spark newbie, but I have a heavy background in languages and compilers... so take this with a barrel of salt... Scala, to me, is the heart and soul of Spark. Couldn't work without it. Procedural languages like Python, Java, and all the rest are lovely when you have a couple of

mismatched hdfs protocol

2014-06-04 Thread bluejoe2008
hi, all when my spark program accessed hdfs files an error happened: Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 it seems the client was trying to connect hadoop2 via an old hadoop protocol so my question is:

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Xiangrui Meng
80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't take that long, even on a single executor. Besides what Matei suggested, could you also verify the executor memory in http://localhost:4040 in the Executors tab. It is very likely the executors do not have enough memory. In that

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Xiangrui Meng
Hi Krishna, Specifying executor memory in local mode has no effect, because all of the threads run inside the same JVM. You can either try --driver-memory 60g or start a standalone server. Best, Xiangrui On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng men...@gmail.com wrote: 80M by 4 should be

Re: Spark Usecase

2014-06-04 Thread Krishna Sankar
Shahab, Interesting question. Couple of points (based on the information from your e-mail) 1. One can support the use case in Spark as a set of transformations on a WIP TDD over a span of time and the final transformation outputting to a processed TDD - Spark streaming would be

Re: Logistic Regression MLLib Slow

2014-06-04 Thread Srikrishna S
I will try both and get back to you soon! Thanks for all your help! Regards, Krishna On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Specifying executor memory in local mode has no effect, because all of the threads run inside the same JVM. You can either

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Patrick Wendell
Hey Sam, You mentioned two problems here, did your VPC error message get fixed or only the key permissions problem? I noticed we had some report a similar issue with the VPC stuff a long time back (but there is no real resolution here): https://spark-project.atlassian.net/browse/SPARK-1166 If

  1   2   >