Parquet Hive table become very slow on 1.3?

2015-03-30 Thread Zheng, Xudong
Hi all, We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we find that, just a simple COUNT(*) query will much slower (100x) than Spark 1.2. I find the most time spent on driver to get HDFS blocks. I find large amount of get below logs printed: 15/03/30 23:03:43 DEBUG Proto

Re: How to configure SparkUI to use internal ec2 ip

2015-03-30 Thread Akhil Das
You can add an internal ip to public hostname mapping in your /etc/hosts file, if your forwarding is proper then it wouldn't be a problem there after. Thanks Best Regards On Tue, Mar 31, 2015 at 9:18 AM, anny9699 wrote: > Hi, > > For security reasons, we added a server between my aws Spark Cl

Re: Spark 1.3 build with hive support fails

2015-03-30 Thread Bojan Kostic
Try building with scala 2.10. Best Bojan On Mar 31, 2015 01:51, "nightwolf [via Apache Spark User List]" < ml-node+s1001560n22309...@n3.nabble.com> wrote: > I am having the same problems. Did you find a fix? > > -- > If you reply to this email, your message will be ad

Re: Actor not found

2015-03-30 Thread Shixiong Zhu
Could you paste the whole stack trace here? Best Regards, Shixiong Zhu 2015-03-31 2:26 GMT+08:00 sparkdi : > I have the same problem, i.e. exception with the same call stack when I > start > either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu > 14.10. > bin/pyspark > > bunch

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread ๏̯͡๏
I have raised a JIRA - https://issues.apache.org/jira/browse/SPARK-6622 . In order to track this issue and possibly if it requires a fix from Spark On Tue, Mar 31, 2015 at 9:31 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Hello Lian, > This blog talks about how to install Hive meta store. I thing that i took > fr

Re: Build fails on 1.3 Branch

2015-03-30 Thread Marty Bower
Confirmed (on 1.3 branch) - thanks. On Sun, Mar 29, 2015 at 12:08 PM Reynold Xin wrote: > I pushed a hotfix to the branch. Should work now. > > > On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower wrote: > >> Yes, that worked - thank you very much. >> >> >> >> On Sun, Mar 29, 2015 at 9:05 AM Ted Yu

How to configure SparkUI to use internal ec2 ip

2015-03-30 Thread anny9699
Hi, For security reasons, we added a server between my aws Spark Cluster and local, so I couldn't connect to the cluster directly. To see the SparkUI and its related work's stdout and stderr, I used dynamic forwarding and configured the SOCKS proxy. Now I could see the SparkUI using the internal

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same error as you did. As soon as I added it in, I was good to go again: CREATE TEMPORARY TABLE jsonTable USING org.apa

Re: When will 1.3.1 release?

2015-03-30 Thread Michael Armbrust
I'm hoping to cut an RC this week. We are just waiting for a few other critical fixes. On Mon, Mar 30, 2015 at 12:54 PM, Kelly, Jonathan wrote: > Are you referring to SPARK-6330 > ? > > If you are able to build Spark from source yourself, I b

Re: data frame API, change groupBy result column name

2015-03-30 Thread Michael Armbrust
You'll need to use the longer form for aggregation: tb2.groupBy("city", "state").agg(avg("price").as("newName")).show depending on the language you'll need to import: scala: import org.apache.spark.sql.functions._ python: from pyspark.sql.functions import * On Mon, Mar 30, 2015 at 5:49 PM, Neal

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
The stack trace for the first scenario and your suggested improvement is similar, with as only difference the first line (Sorry for not including this): "Log directory /home/hduser/spark/spark-events does not exist." To verify your premises, I cd'ed into the directory by copy pasting the path list

Re: WordCount example

2015-03-30 Thread Mohit Anchlia
I tried to file a bug in git repo however I don't see a link to "open issues" On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia wrote: > I checked the ports using netstat and don't see any connections > established on that port. Logs show only this: > > 15/03/27 13:50:48 INFO Master: Registering a

data frame API, change groupBy result column name

2015-03-30 Thread Neal Yin
I ran a line like following: tb2.groupBy("city", "state").avg("price").show I got result: city state AVG(price) Charlestown New South Wales 1200.0 Newton ... MA 1200.0 Coral Gables ... FL 1200.0 CastricumNoord-H

Task size is large when CombineTextInputFormat is used

2015-03-30 Thread Taeyun Kim
Hi, I used CombineTextInputFormat to read many small files. The Java code is as follows (I've written it as an utility function): public static JavaRDD combineTextFile(JavaSparkContext sc, String path, long maxSplitSize, boolean recursive) { Configuration conf = new Confi

Spark Streaming on YARN with loss of application master

2015-03-30 Thread Matt Narrell
I’m looking at various HA scenarios with Spark streaming. We’re currently running a Spark streaming job that is intended to be long-lived, 24/7. We see that if we kill node managers that are hosting Spark workers, new node managers assume execution of the jobs that were running on the stopped

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Marcelo Vanzin
So, the error below is still showing the invalid configuration. You mentioned in the other e-mails that you also changed the configuration, and that the directory really, really exists. Given the exception below, the only ways you'd get the error with a valid configuration would be if (i) the dire

Re: Spark Streaming - Subroutine not being executed more than once

2015-03-30 Thread jhakku
hey all, I am trying to figure out if I can use for building loosely coupled distributed data pipelines. This is part of a pitch that I am trying to come up. I'd really appreciate if someone can comment if this is possible or no. Many Thanks -- View this message in context: http://apache-spark

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
I run Spark in local mode. Command line (added some debug info): hduser@hadoop7:~/spark-terasort$ ./bin/run-example SparkPi 10 Jar: /home/hduser/spark-terasort/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop2.4.0.jar /home/hduser/spark-terasort/bin/spark-submit --master local[*] --

Re: MLlib Spam example gets stuck in Stage X

2015-03-30 Thread Su She
Thank you for updating the files Holden! I actually was using that same text in my files located on HDFS. Could the files being located on HDFS be the reason why the example gets stuck? I c/p the code provided on github, the only things I changed were: a) file paths to: val spam = sc.textFile("hdf

Re: Spark 1.3 build with hive support fails

2015-03-30 Thread nightwolf
I am having the same problems. Did you find a fix? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22309.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
We test large feature dimension but not very large k (https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525). Again, please create a JIRA and post your test code and a link to your test dataset, we can work on it. It is hard to track the issue with multiple threads in

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Marcelo Vanzin
Are you running Spark in cluster mode by any chance? (It always helps to show the command line you're actually running, and if there's an exception, the first few frames of the stack trace.) On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen wrote: > Updated spark-defaults and spark-env: > "Log dir

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom Hubregtsen
Updated spark-defaults and spark-env: "Log directory /home/hduser/spark/spark-events does not exist." (Also, in the default /tmp/spark-events it also did not work) On 30 March 2015 at 18:03, Marcelo Vanzin wrote: > Are those config values in spark-defaults.conf? I don't think you can > use "~" t

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Marcelo Vanzin
Are those config values in spark-defaults.conf? I don't think you can use "~" there - IIRC it does not do any kind of variable expansion. On Mon, Mar 30, 2015 at 3:50 PM, Tom wrote: > I have set > spark.eventLog.enabled true > as I try to preserve log files. When I run, I get > "Log directory /tm

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xi Shen
For the same amount of data, if I set the k=500, the job finished in about 3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest time I waited was 12 hrs... If I use kmeans-random, same amount of data, k=5000, the job finished in less than 2 hrs. I think current kmeans|| i

"Spark-events does not exist" error, while it does with all the req. rights

2015-03-30 Thread Tom
I have set spark.eventLog.enabled true as I try to preserve log files. When I run, I get "Log directory /tmp/spark-events does not exist." I set spark.local.dir ~/spark spark.eventLog.dir ~/spark/spark-events and SPARK_LOCAL_DIRS=~/spark Now I get: "Log directory ~/spark/spark-events does not ex

Re: Why is a Spark job faster through Eclipse than Standalone Cluster

2015-03-30 Thread rival95
I re-ran my application through Eclipse on Ubuntu and received slower than expected results of 6.1 minutes. So the question is now, why would there be such a difference of run times between Windows 7 and Ubuntu 14.04? -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple of TeraBytes is not that challenging (depending on the algorithm) these days where as 5 years ago it was a big challenge. We have a bit over a PetaByte (not using Spark) and using a distributed system is the only viable way

Re: Spark 1.3.0 Build Failure

2015-03-30 Thread Marcelo Vanzin
This sounds like SPARK-6532. On Mon, Mar 30, 2015 at 1:34 PM, ARose wrote: > So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using > Maven, but I'm getting a build failure. > > java -version > java version "1.8.0_31" > Java(TM) SE Runtime Environment (build 1.8.0_31-b13) > Jav

Java and Kryo Serialization, Java.io.OptionalDataException

2015-03-30 Thread zia_kayani
I have set Kryo Serializer as default serializer in SparkConf and Spark UI confirms it too, but in the Spark logs I'm getting this exception, java.io.OptionalDataException at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1370) at java.io.ObjectInputStream.readObject(Obj

Re: Understanding Spark Memory distribution

2015-03-30 Thread Ankur Srivastava
Hi Wisely, I am running spark 1.2.1 and I have checked the process heap and it is running with all the heap that I am assigning and as I mentioned earlier I get OOM on workers not the driver or master. Thanks Ankur On Mon, Mar 30, 2015 at 9:24 AM, giive chen wrote: > Hi Ankur > > If you using

Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this pro

Re: k-means can only run on one executor with one thread?

2015-03-30 Thread Xiangrui Meng
Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens. Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM, Xi

Why is a Spark job faster through Eclipse than Standalone Cluster

2015-03-30 Thread rival95
When I run my code in Eclipse with the following parameters, VM Args: -Xmx4g OS: Windows Time: 4.4 minutes It is faster than submitting to a cluster with these parameters: SPARK_EXECUTOR_MEMORY=4G OS: Ubuntu Time: 5.2 minutes They are equivalent options are they not? Both environments run on th

Re: Setting a custom loss function for GradientDescent

2015-03-30 Thread Xiangrui Meng
You can extend Gradient, e.g., https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266, and use it in GradientDescent: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescen

Registering classes with KryoSerializer

2015-03-30 Thread Arun Lists
I am trying to register classes with KryoSerializer. I get the following error message: How do I find out what class is being referred to by: *OpenHashMap$mcI$sp ?* *com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: com.comp.common.base.OpenHash

Re: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread nsalian
Try running it like this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 Caveats: 1) Make sure the permissions of /user/nick is 775 or 777. 2) No need for hostnam

Re: Spark-submit not working when application jar is in hdfs

2015-03-30 Thread nsalian
Client mode would not support HDFS jar extraction. I tried this: sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10 And it worked. -- View this message in context:

Spark 1.3.0 Build Failure

2015-03-30 Thread ARose
So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using Maven, but I'm getting a build failure. java -version java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Here is the command I am usi

log4j.properties in jar

2015-03-30 Thread Udit Mehta
Hi, Is it possible to put the log4j.properties in the application jar such that the driver and the executors use this log4j file. Do I need to specify anything while submitting my app so that this file is used? Thanks, Udit

Re: Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
Ah, never mind, I found the jar command in the java-1.7.0-openjdk-devel package. I only had java-1.7.0-openjdk installed. Looks like I just need to install java-1.7.0-openjdk-devel then set JAVA_HOME to /usr/lib/jvm/java instead of /usr/lib/jvm/jre. ~ Jonathan Kelly From: , Jonathan Kelly ma

Spark and OpenJDK - jar: No such file or directory

2015-03-30 Thread Kelly, Jonathan
I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the compute-classpath.sh script is not adding the datanucleus jars to the classpath because compute-classpath.sh is assuming to find the jar command in $JAVA_HOME/bin/jar, which does not exist for OpenJDK. Is this an issue anybody e

Re: When will 1.3.1 release?

2015-03-30 Thread Kelly, Jonathan
Are you referring to SPARK-6330? If you are able to build Spark from source yourself, I believe you should just need to cherry-pick the following commits in order to backport the fix: 67fa6d1f830dee37244b5a30684d797093c7c134 [SPARK-6330] Fix fil

When will 1.3.1 release?

2015-03-30 Thread Shuai Zheng
Hi All, I am waiting the spark 1.3.1 to fix the bug to work with S3 file system. Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because there is jar compatible issue to work with AWS SDK. Regards, Shuai

Re: Actor not found

2015-03-30 Thread sparkdi
I have the same problem, i.e. exception with the same call stack when I start either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu 14.10. bin/pyspark bunch of INFO messages, then ActorInitializationException exception. Shell starts, I can do this: >>> rd = sc.parallelize([1,2]

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
On 30 Mar 2015, at 13:27, jay vyas mailto:jayunit100.apa...@gmail.com>> wrote: Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics"...now maybe we are challenging the assumption that "big data analytics

Re: Cannot run spark-shell "command not found".

2015-03-30 Thread Manas Kar
If you are only interested in getting a hands on with Spark and not with building it with specific version of Hadoop use one of the bundle provider like cloudera. It will give you a very easy way to install and monitor your services.( I find installing via cloudera manager http://www.cloudera.com/

Re: Cannot run spark-shell "command not found".

2015-03-30 Thread roni
I think you must have downloaded the spark source code gz file. It is little confusing. You have to select the hadoop version also and the actual tgz file will have spark version and hadoop version in it. -R On Mon, Mar 30, 2015 at 10:34 AM, vance46 wrote: > Hi all, > > I'm a newbee try to se

Cannot run spark-shell "command not found".

2015-03-30 Thread vance46
Hi all, I'm a newbee try to setup spark for my research project on a RedHat system. I've downloaded spark-1.3.0.tgz and untared it. and installed python, java and scala. I've set JAVA_HOME and SCALA_HOME and then try to use "sudo sbt/sbt assembly" according to https://docs.sigmoidanalytics.com/ind

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-30 Thread Zhan Zhang
Hi Folks, Just to summarize it to run SPARK on HDP distribution. 1. The spark version has to be 1.3.0 and above if you are using upstream distribution. This configuration is mainly for HDP rolling upgrade purpose, and the patch only went into spark upstream from 1.3.0. 2. In $SPARK_HOME/conf/

RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
Sean Yes I know that I can use persist() to persist to disk, but it is still a big extra cost of persist a huge RDD to disk. I hope that I can do one pass to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know how. May be use accumulator to count the total ? Ningjun From: M

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Ted Yu
Nicolas: See if there was occurrence of the following exception in the log: errs => throw new SparkException( s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " + errs.mkString("\n")), Cheers On Mon, Mar 30, 2015 at 9:40 AM, Cody Koeninge

Re: Why k-means cluster hang for a long time?

2015-03-30 Thread Xiangrui Meng
Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k? Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen wrote: > Hi Burak, > > After I added .repartition(sc.defaultParallelism), I can see from the log > the partition number is set to 32. But in the S

Re: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-30 Thread Gaspar Muñoz
Hello, Thank you for your contribution. We have tried to reproduce your error but we need more information: - Which Spark version are you using? Stratio Spark-Mongodb connector supports 1.2.x SparkSQL version. - What jars are you adding while launching the Spark-shell? Best regards, 2015-03-0

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-30 Thread Xiangrui Meng
Okay, I didn't realize that I changed the behavior of lambda in 1.3. to make it "scale-invariant", but it is worth discussing whether this is a good change. In 1.2, we multiply lambda by the number ratings in each sub-problem. This makes it "scale-invariant" for explicit feedback. However, in impli

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Shivaram Venkataraman
One workaround could be to convert a DataFrame into a RDD inside the transform function and then use mapPartitions/broadcast to work with the JNI calls and then convert back to RDD. Thanks Shivaram On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa wrote: > Dear all, > > I'm still struggling to

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Cody Koeninger
This line at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close( KafkaRDD.scala:158) is the attempt to close the underlying kafka simple consumer. We can add a null pointer check, but the underlying issue of the consumer being null probably indicates a problem earlier. Do you see

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
Ah, sorry, my bad... http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html On 3/30/15 10:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: Hello Lian Can you share the URL ? On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian > wro

Re: Understanding Spark Memory distribution

2015-03-30 Thread giive chen
Hi Ankur If you using standalone mode, your config is wrong. You should use "export SPARK_DAEMON_MEMORY=xxx " in config/spark-env.sh. At least it works on my spark 1.3.0 standalone mode machine. BTW, The SPARK_DRIVER_MEMORY is used in Yarn mode and looks like the standalone mode don't use this c

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Akhil Das
Did you try this example? https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala I think you need to create a topic set with # partitions to consume. Thanks Best Regards On Mon, Mar 30, 2015 at 9:35 PM, Nicolas Phu

Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Nicolas Phung
Hello, I'm using spark-streaming-kafka 1.3.0 with the new consumer "Approach 2: Direct Approach (No Receivers)" ( http://spark.apache.org/docs/latest/streaming-kafka-integration.html). I'm using the following code snippets : // Create direct kafka stream with brokers and topics val messages = Kaf

Re: Job Opportunity in London

2015-03-30 Thread Akhil Das
Maybe you should mail him directly on j.bo...@ucl.ac.uk Thanks Best Regards On Mon, Mar 30, 2015 at 8:47 PM, Chitturi Padma < learnings.chitt...@gmail.com> wrote: > Hi, > > I am interested in this opportunity. I am working as Research Engineer in > Impetus Technologies, Bangalore, India. In fact

actorStream woes

2015-03-30 Thread Marius Soutier
Hi there, I'm using Spark Streaming 1.2.1 with actorStreams. Initially, all goes well. 15/03/30 15:37:00 INFO spark.storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.2 KB, free 1589.8 MB) 15/03/30 15:37:00 INFO spark.storage.BlockManagerInfo: Added broadcast_1_p

Re: How to avoid being killed by YARN node manager ?

2015-03-30 Thread Y. Sakamoto
Thank you for your reply. I'm sorry confirmation is slow. I'll try the tuning 'spark.yarn.executor.memoryOverhead'. Thanks, Yuichiro Sakamoto On 2015/03/25 0:56, Sandy Ryza wrote: Hi Yuichiro, The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the executors have enou

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-30 Thread Jaonary Rabarisoa
Dear all, I'm still struggling to make a pre-trained caffe model transformer for dataframe works. The main problem is that creating a caffe model inside the UDF is very slow and consumes memories. Some of you suggest to broadcast the model. The problem with broadcasting is that I use a JNI interf

Re: Job Opportunity in London

2015-03-30 Thread Chitturi Padma
Hi, I am interested in this opportunity. I am working as Research Engineer in Impetus Technologies, Bangalore, India. In fact we implemented Distributed Deep Learning on Spark. Will share my CV if you are interested. Please visit the below link: http://www.signalprocessingsociety.org/newsletter/2

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964
I think the jar file has to be local. In HDFS is not supported yet in Spark. See this answer: http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs > Date: Sun, 29 Mar 2015 22:34:46 -0700 > From: n.e.trav...@gmail.com > To: user@spark.apache.org > Sub

Online Realtime Recommendation System

2015-03-30 Thread dvpe
Hi, I like to have a online realtime recommendation system. I have a ALS model but I want to add the new data on realtime. Is it possible??? any guidelines??? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Online-Realtime-Recommendation-System-tp22297.html

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-03-30 Thread dvpe
Hi, do you have any updates -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942p22296.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark Streaming/Flume display all events

2015-03-30 Thread Nathan Marin
Hi, DStream.print() only prints the first 10 elements contained in the Stream. You can call DStream.print(x) to print the first x elements but if you don’t know the exact count you can call DStream.foreachRDD and apply a function to display the content of every RDD. For example: stream.foreach

Spark Streaming/Flume display all events

2015-03-30 Thread Chong Zhang
Hi, I am new to Spark/Streaming, and tried to run modified FlumeEventCount.scala example to display all events by adding the call: stream.map(e => "Event:header:" + e.event.get(0).toString + "body: " + new String(e.event.getBody.array)).print() The spark-submit runs fine with --master local

Re: Too many open files

2015-03-30 Thread Masf
I'm executing my application in local mode (with --master local[*]). I'm using ubuntu and I've put "session required pam_limits.so" into /etc/pam.d/common-session but it doesn't work On Mon, Mar 30, 2015 at 4:08 PM, Ted Yu wrote: > bq. In /etc/secucity/limits.conf set the next values: > > Have

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread ๏̯͡๏
Hello Lian Can you share the URL ? On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian wrote: > The "mysql" command line doesn't use JDBC to talk to MySQL server, so > this doesn't verify anything. > > I think this Hive metastore installation guide from Cloudera may be > helpful. Although this document

Re: python : Out of memory: Kill process

2015-03-30 Thread Eduardo Cusa
Hi, I change my process flow. Now I am processing a file per hour, instead of process at the end of the day. This decreased the memory comsuption . Regards Eduardo On Thu, Mar 26, 2015 at 3:16 PM, Davies Liu wrote: > Could you narrow down to a step which cause the OOM, something like:

Re: Too many open files

2015-03-30 Thread Ted Yu
bq. In /etc/secucity/limits.conf set the next values: Have you done the above modification on all the machines in your Spark cluster ? If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains the following line: session required pam_limits.so On Mon, Mar 30, 2015 at 5:08 AM

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-30 Thread Doug Balog
The “best” solution to spark-shell’s problem is creating a file $SPARK_HOME/conf/java-opts with “-Dhdp.version=2.2.0.0-2014” Cheers, Doug > On Mar 28, 2015, at 1:25 PM, Michael Stone wrote: > > I've also been having trouble running 1.3.0 on HDP. The > spark.yarn.am.extraJavaOptions -Dhdp.ve

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet wrote: > I wan

Re: DataFrame and non-lazy RDD operation

2015-03-30 Thread Wail
One more thing. How can I map it so that I don't get a list of objects of type "Any"? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-and-non-lazy-RDD-operation-tp22293p22294.html Sent from the Apache Spark User List mailing list archive a

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-30 Thread Cheng Lian
Try this in Spark shell: |import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.sql.hive.HiveContext val jsc = new JavaSparkContext(sc) val hc = new HiveContext(jsc.sc) | (I never mentioned that JavaSparkContext extends SparkContext…) Cheng On 3/30/15 8:28 PM, Vi

Re: Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
I worked, thank you. On 30.03.2015 11:58, Sean Owen wrote: The behavior is the same. I am not sure it's a problem as much as design decision. It does not require everything to stay in memory, but the values for one key at a time. Have a look at how the preceding shuffle works. Consider repartit

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
The "mysql" command line doesn't use JDBC to talk to MySQL server, so this doesn't verify anything. I think this Hive metastore installation guide from Cloudera may be helpful. Although this document is for CDH4, the general steps are the same, and should help you to figure out the relationshi

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics"...now maybe we are challenging the assumption that "big data analytics need to distributed"? I've been asking the same question lately and seen similarly t

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-30 Thread Vincent He
thanks. That is what I have tried. JavaSparkContext does not extend SparkContext, it can not be used here. Anyone else know whether we can use HiveContext with JavaSparkContext, from API documents, seems this is not supported. thanks. On Sun, Mar 29, 2015 at 9:24 AM, Cheng Lian wrote: > I mean

Re: Too many open files

2015-03-30 Thread Masf
Hi. I've done relogin, in fact, I put 'uname -n' and returns 100, but it crashs. I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file) Regards. Miguel Angel. On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das wrote: > Mostly, you will have to restart the machines to get the ul

Re: Too many open files

2015-03-30 Thread Akhil Das
Mostly, you will have to restart the machines to get the ulimit effect (or relogin). What operation are you doing? Are you doing too many repartitions? Thanks Best Regards On Mon, Mar 30, 2015 at 4:52 PM, Masf wrote: > Hi > > I have a problem with temp data in Spark. I have fixed > spark.shuffl

Re: SparkSQL Timestamp query failure

2015-03-30 Thread anu
Hi Alessandro Could you specify which query were you able to run successfully? 1. sqlContext.sql("SELECT * FROM Logs as l where l.timestamp = '2012-10-08 16:10:36' ").collect OR 2. sqlContext.sql("SELECT * FROM Logs as l where cast(l.timestamp as string) = '2012-10-08 16:10:36.0').collect I

Too many open files

2015-03-30 Thread Masf
Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to "SORT". In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it

Re: RDD collect hangs on large input data

2015-03-30 Thread Zsolt Tóth
Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not larger than the one returned with 50GB input data. The end of the stack trace: The two IP's are the two worker nodes, I think

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Thanks Saisai. I will try your solution, but still i don't understand why filesystem should be used where there is a plenty of memory available! On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao wrote: > Shuffle write will finally spill the data into file system as a bunch of > files. If you want

Re: Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Thanks Sean! Do you know if there is a way (even manually) to delete these intermediate shuffle results? I was just want to test the "expected" behaviour. I know that re-caching might be a positive action most of the times but I want to try it without it. Renato M. 2015-03-30 12:15 GMT+02:00 Sea

Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-30 Thread Nathan Marin
Hi, thanks for your quick answers. I looked at what was being written on disk and a folder called blockmgr-d0236c76-7f7c-4a60-a6ae-ffc622b2db84 was enlarging every second. This folder contained shuffle data and was not being cleaned (after 30minutes of my application running it contained the shuff

Re: Spark caching

2015-03-30 Thread Sean Owen
I think that you get a sort of "silent" caching after shuffles, in some cases, since the shuffle files are not immediately removed and can be reused. (This is the flip side to the frequent question/complaint that the shuffle files aren't removed straight away.) On Mon, Mar 30, 2015 at 9:43 AM, Re

Re: Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Sean Owen
The behavior is the same. I am not sure it's a problem as much as design decision. It does not require everything to stay in memory, but the values for one key at a time. Have a look at how the preceding shuffle works. Consider repartitionAndSortWithinPartition to *partition* by hour and then sort

Receive on driver program (without serializing)

2015-03-30 Thread MartijnD
We are building a wrapper that makes it possible to use reactive streams (i.e. Observable, see reactivex.io) as input to Spark Streaming. We therefore tried to create a custom receiver for Spark. However, the Observable lives at the driver program and is generally not serializable. Is it possible

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Steve Loughran
Note that even the Facebook "four degrees of separation" paper went down to a single machine running WebGraph (http://webgraph.di.unimi.it/) for the final steps, after running jobs in there Hadoop cluster to build the dataset for that final operation. "The computations were performed on a 24-c

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-30 Thread Saisai Shao
Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure "spark.local.dir" to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15

回复:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
got it.Thank you Thanks&Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Saisai Shao 收件人:罗辉 抄送人:user 主题:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka? 日期:2015年03月30日 17点05分 This warning is not related to "--from-beginning". It means there's

why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Hi, I was looking at SparkUI, Executors, and I noticed that I have 597 MB for "Shuffle while I am using cached temp-table and the Spark had 2 GB free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be done in memor

Problem with groupBy and OOM when just writing the group in a file

2015-03-30 Thread Mario Pastorelli
we are experiencing some problems with the groupBy operations when used to group together data that will be written in the same file. The operation that we want to do is the following: given some data with a timestamp, we want to sort it by timestamp, group it by hour and write one file per hou

Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread Saisai Shao
This warning is not related to "--from-beginning". It means there's no new data for current partition in current batch duration, it is acceptable. If you pushing the data into Kafka again, this warning log will be disappeared. Thanks Saisai 2015-03-30 16:58 GMT+08:00 : > BTW, what's the matter a

  1   2   >