Re: distinct in data frame in spark

2014-03-25 Thread Andrew Ash
My thought would be to key by the first item in each array, then take just one array for each key. Something like the below: v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b = a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu

Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread qingyang li
reopen this thread because i encounter this problem again. Here is my env: scala 2.10.3 s spark 0.9.0tandalone mode shark 0.9.0downlaod the source code and build by myself hive hive-shark-0.11 I have copied hive-site.xml from my hadoop cluster , it's hive version is 0.12, after copied , i

Re: Java API - Serialization Issue

2014-03-25 Thread santhoma
This worked great. Thanks a lot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3178.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

tracking resource usage for spark-shell commands

2014-03-25 Thread Bharath Bhushan
Is there a way to see the resource usage of each spark-shell command — say time taken and memory used? I checked the WebUI of spark-shell and of the master and I don’t see any such breakdown. I see the time taken in the INFO logs but nothing about memory usage. It would also be nice to track

Re: Shark does not give any results with SELECT count(*) command

2014-03-25 Thread Praveen R
Hi Qingyang Li, Shark-0.9.0 uses a patched version of hive-0.11 and using configuration/metastore of hive-0.12 could be incompatible. May I know the reason you are using hive-site.xml from previous hive version(to use existing metastore?). You might just leave hive-site.xml blank, otherwise.

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Andrew Ash
Possibly one of your executors is in the middle of a large stop-the-world GC and doesn't respond to network traffic during that period? If you shared some information about how each node in your cluster is set up (heap size, memory, CPU, etc) that might help with debugging. Andrew On Mon, Mar

Re: Pig on Spark

2014-03-25 Thread lalit1303
Hi, I have been following Aniket's spork github repository. https://github.com/aniket486/pig I have done all the changes mentioned in recently modified pig-spark file. I am using: hadoop 2.0.5 alpha spark-0.8.1-incubating mesos 0.16.0 ##PIG variables export

Change print() in JavaNetworkWordCount

2014-03-25 Thread Eduardo Costa Alfaia
Hi Guys, I think that I already did this question, but I don't remember if anyone has answered me. I would like changing in the function print() the quantity of words and the frequency number that are sent to driver's screen. The default value is 10. Anyone could help me with this? Best

Re: Change print() in JavaNetworkWordCount

2014-03-25 Thread Sourav Chandra
You can extend DStream and override print() method. Then you can create your own DSTream extending from this. On Tue, Mar 25, 2014 at 6:07 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I think that I already did this question, but I don't remember if anyone has answered

tuple as keys in pyspark show up reversed

2014-03-25 Thread Friso van Vollenhoven
Hi, I have an example where I use a tuple of (int,int) in Python as key for a RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the two int's reversed in order (which is problematic, as the ordering is part of the key). Here is a ipython notebook that has some code and

Re: tuple as keys in pyspark show up reversed

2014-03-25 Thread Friso van Vollenhoven
OK, forget about this question. It was a nasty, one character typo in my own code (sorting by rating instead of item at one point). Best, Friso On Tue, Mar 25, 2014 at 1:53 PM, Friso van Vollenhoven f.van.vollenho...@gmail.com wrote: Hi, I have an example where I use a tuple of (int,int) in

K-means faster on Mahout then on Spark

2014-03-25 Thread Egor Pahomov
Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next results for k-means: Number of iterations= 10, number of elements = 1000, mahouttime= 602, spark time = 138 Number of iterations= 40, number of elements = 1000, mahouttime= 1917, spark time = 330 Number of

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Guillaume Pitel (eXenSa)
Maybe with MEMORY_ONLY, spark has to recompute the RDD several times because they don't fit in memory. It makes things run slower. As a general safe rule, use MEMORY_AND_DISK_SER Guillaume Pitel - Président d'eXenSa Prashant Sharma scrapco...@gmail.com a écrit : I think Mahout uses

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Suneel Marthi
Mahout does have a kmeans which can be executed in mapreduce and iterative modes. Sent from my iPhone On Mar 25, 2014, at 9:25 AM, Prashant Sharma scrapco...@gmail.com wrote: I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative. Prashant Sharma On

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Egor Pahomov
Mahout used MR and made one MR on every iteration. It worked as predicted. My question more about why spark was so slow. I would try MEMORY_AND_DISK_SER 2014-03-25 17:58 GMT+04:00 Suneel Marthi suneel_mar...@yahoo.com: Mahout does have a kmeans which can be executed in mapreduce and iterative

Re: K-means faster on Mahout then on Spark

2014-03-25 Thread Prashant Sharma
I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative. Prashant Sharma On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov pahomov.e...@gmail.comwrote: Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next results for k-means: Number of

RE: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Adrian Mocanu
Let me rephrase that, Do you think it is possible to use an accumulator to skip the first few incomplete RDDs? -Original Message- From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-25-14 9:57 AM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE:

Running a task once on each executor

2014-03-25 Thread deenar.toraskar
Hi Is there a way in Spark to run a function on each executor just once. I have a couple of use cases. a) I use an external library that is a singleton. It keeps some global state and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I want to check the global state of this

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say just once, have you defined across multiple what (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that

ClassCastException when using saveAsTextFile

2014-03-25 Thread Niko Stahl
Hi, I'm trying to save an RDD to HDFS with the saveAsTextFile method on my ec2 cluster and am encountering the following exception (the app is called GraphTest): Exception failure: java.lang.ClassCastException: cannot assign instance of GraphTest$$anonfun$3 to field

Re: Akka error with largish job (works fine for smaller versions)

2014-03-25 Thread Nathan Kronenfeld
After digging deeper, I realized all the workers ran out of memory, giving an hs_error.log file in /tmp/jvm-PID with the header: # Native memory allocation (malloc) failed to allocate 2097152 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce

Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi All, I'm getting the following error when I execute start-master.sh which also invokes spark-class at the end. Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/ You need to build Spark with 'sbt/sbt assembly' before running this program. After digging into the

Re: spark executor/driver log files management

2014-03-25 Thread Tathagata Das
The logs from the executor are redirected to stdout only because there is a default log4j.properties that is configured to do so. If you put your log4j.properties with rolling file appender in the classpath (refer to Spark docs for that), all the logs will get redirected to a separate files that

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21, 2014

RE: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?

2014-03-25 Thread Andrew Lee
Hi Paul, I got it sorted out. The problem is that the JARs are built into the assembly JARs when you run sbt/sbt clean assembly What I did is:sbt/sbt clean package This will only give you the small JARs. The next steps is to update the CLASSPATH in the bin/compute-classpath.sh script manually,

Re: Spark 0.9.0-incubation + Apache Hadoop 2.2.0 + YARN encounter Compression codec com.hadoop.compression.lzo.LzoCodec not found

2014-03-25 Thread alee526
You can try to add the following to your shell: In bin/compute-classpath.sh, append the JAR lzo JAR from Mapreduce: CLASSPATH=$CLASSPATH:$HADOOP_HOME/share/hadoop/mapreduce/lib/hadoop-lzo.jar export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native/ export

[BLOG] Shark on Cassandra

2014-03-25 Thread Brian O'Neill
As promised, here is that follow-up post for those looking to get started with Shark against Cassandra: -- Brian ONeill CTO, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42

[BLOG] : Shark on Cassandra

2014-03-25 Thread Brian O'Neill
As promised, here is that followup post for those looking to get started with Shark against Cassandra: http://brianoneill.blogspot.com/2014/03/shark-on-cassandra-w-cash-interrogating.html Again -- thanks to Rohit and the team at TupleJump. Great work. -brian -- Brian ONeill CTO, Health Market

Re: Spark Streaming ZeroMQ Java Example

2014-03-25 Thread Tathagata Das
Unfortunately there isnt one right now. But it is probably too hard to start with the JavaNetworkWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java, and use the ZeroMQUtils in the same way as the

Re: Implementation problem with Streaming

2014-03-25 Thread Mayur Rustagi
2 good benefits of Streaming 1. maintains windows as you move across time, removing adding monads as you move through the window 2. Connect with streaming systems like kafka to import data as it comes process it You dont seem to need any of these features, you would be better off using Spark

Re: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Tathagata Das
You can probably do it in a simpler but sort of hacky way! If your window size is W and sliding interval S, you can do some math to figure out how many of the first windows are actually partial windows. Its probably math.ceil(W/S) . So in a windowDStream.foreachRDD() you can increment a global

Re: Writing RDDs to HDFS

2014-03-25 Thread Ognen Duzlevski
Well, my long running app has 512M per executor on a 16 node cluster where each machine has 16G of RAM. I could not run a second application until I restricted the spark.cores.max. As soon as I restricted the cores, I am able to run a second job at the same time. Ognen On 3/24/14, 7:46 PM,

Re: rdd.saveAsTextFile problem

2014-03-25 Thread gaganbm
Hi Folks, Is this issue resolved ? If yes, could you please throw some light on how to fix this ? I am facing the same problem during writing to text files. When I do stream.foreachRDD(rdd ={ rdd.saveAsTextFile(Some path) }) This works