Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
That sounds good to me. Shall I open a JIRA / PR about updating the site community page? On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com wrote: Hey Nick, So I think we what can do is encourage people to participate on the stack overflow topic, and this I think we can do

Installing Spark Standalone to a Cluster

2015-01-23 Thread riginos
Do i need to manually install and configure hadoop before doing anything with spark standalone? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21339.html Sent from the Apache Spark User List mailing list archive

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-23 Thread Yana Kadiyska
if you're running the test via sbt you can examine the classpath that sbt uses for the test (show runtime:full-classpath or last run)-- I find this helps once too many includes and excludes interact. On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I use spark

Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Gerard Maas
+1 On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That sounds good to me. Shall I open a JIRA / PR about updating the site community page? On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com wrote: Hey Nick, So I think we what can

Installing Spark Standalone to a Cluster

2015-01-23 Thread riginos
I need someone to send me a snapshot of his /conf/spark-env.sh file cause i don't understand how to set some vars like SPARK_MASTER etc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-Standalone-to-a-Cluster-tp21341.html Sent from the

Large number of pyspark.daemon processes

2015-01-23 Thread Sven Krasser
Hey all, I am running into a problem where YARN kills containers for being over their memory allocation (which is about 8G for executors plus 6G for overhead), and I noticed that in those containers there are tons of pyspark.daemon processes hogging memory. Here's a snippet from a container with

Re: Results never return to driver | Spark Custom Reader

2015-01-23 Thread Yana Kadiyska
It looks to me like your executor actually crashed and didn't just finish properly. Can you check the executor log? It is available in the UI, or on the worker machine, under $SPARK_HOME/work/ app-20150123155114-/6/stderr (unless you manually changed the work directory location but in that

Re: Installing Spark Standalone to a Cluster

2015-01-23 Thread HARIPRIYA AYYALASOMAYAJULA
not needed. You can directly follow the installation However you might need sbt to package your files to jar. On Fri, Jan 23, 2015 at 11:54 AM, riginos samarasrigi...@gmail.com wrote: Do i need to manually install and configure hadoop before doing anything with spark standalone? -- View

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread Tathagata Das
Hello mingyu, That is a reasonable way of doing this. Spark Streaming natively does not support sticky because Spark launches tasks based on data locality. If there is no locality (example reduce tasks can run anywhere), location is randomly assigned. So the cogroup or join introduces a locality

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
Thanks. I looked at the dependency tree. I did not see any dependent jar of hadoop-core from hadoop2. However the jar built from maven has the class: org/apache/hadoop/mapreduce/task/TaskAttemptContextImpl.class Do you know why? Date: Fri, 23 Jan 2015 17:01:48 + Subject: RE: spark

spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-23 Thread Manoj Samel
Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread main java.lang.RuntimeException: Unsupported datatype TimestampType at

Starting a spark streaming app in init.d

2015-01-23 Thread Ashic Mahtab
Hello, I'm trying to kick off a spark streaming job to a stand alone master using spark submit inside of init.d. This is what I have: DAEMON=spark-submit --class Streamer --executor-memory 500M --total-executor-cores 4 /path/to/assembly.jar start() { $DAEMON -p

Spark Streaming action not triggered with Kafka inputs

2015-01-23 Thread Chen Song
I am running into some problems with Spark Streaming when reading from Kafka.I used Spark 1.2.0 built on CDH5. The example is based on: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala * It works with default

Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
I've tried various ideas, but I'm really just shooting in the dark. I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions) I'm trying to save off to S3 is approximately 1TB in size (with the partitions pretty evenly distributed in size). I just tried a test to dial

Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-5390 On Fri Jan 23 2015 at 12:05:00 PM Gerard Maas gerard.m...@gmail.com wrote: +1 On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That sounds good to me. Shall I open a JIRA / PR about updating the site

Re: Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Sven Krasser
Hey Darin, Are you running this over EMR or as a standalone cluster? I've had occasional success in similar cases by digging through all executor logs and trying to find exceptions that are not caused by the application shutdown (but the logs remain my main pain point with Spark). That aside,

Apache Spark standalone mode: number of cores

2015-01-23 Thread olegshirokikh
I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting: local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). local[*] Run Spark locally

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Josh Rosen
Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I do --

Re: Apache Spark standalone mode: number of cores

2015-01-23 Thread Boromir Widas
The local mode still parallelizes calculations and it is useful for debugging as it goes through the steps of serialization/deserialization as a cluster would. On Fri, Jan 23, 2015 at 5:44 PM, olegshirokikh o...@solver.com wrote: I'm trying to understand the basics of Spark internals and Spark

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-23 Thread Venkat, Ankam
Spark Committers: Please advise the way forward for this issue. Thanks for your support. Regards, Venkat From: Venkat, Ankam Sent: Thursday, January 22, 2015 9:34 AM To: 'Frank Austin Nothaft'; 'user@spark.apache.org' Cc: 'Nick Allen' Subject: RE: How to 'Pipe' Binary Data in Apache Spark How

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Yana Kadiyska
https://issues.apache.org/jira/browse/SPARK-5389 I marked as minor since I also just discovered that I can run it under PowerShell just fine. Vladimir, feel free to change the bug if you're getting a different message or a more serious issue. On Fri, Jan 23, 2015 at 4:44 PM, Josh Rosen

Re: Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
Thanks for the ideas Sven. I'm using stand-alone cluster (Spark 1.2). FWIW, I was able to get this running (just now). This is the first time it's worked in probably my last 10 attempts. In addition to limiting the executors to only 50% of the cluster. In the settings below, I additionally

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
Thanks. But I think I already mark all the Spark and Hadoop reps as provided. Why the cluster's version is not used? Any way, as I mentioned in the previous message, after changing the hadoop-client to version 1.2.1 in my maven deps, I already pass the exception and go to another one as

Re: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread Sean Owen
These are all definitely symptoms of mixing incompatible versions of libraries. I'm not suggesting you haven't excluded Spark / Hadoop, but, this is not the only way Hadoop deps get into your app. See my suggestion about investigating the dependency tree. On Fri, Jan 23, 2015 at 1:53 PM, ey-chih

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
Sorry I still did not quiet get your resolution. In my jar, there are following three related classes: org/apache/hadoop/mapreduce/task/TaskAttemptContextImpl.classorg/apache/hadoop/mapreduce/task/TaskAttemptContextImpl$DummyReporter.classorg/apache/hadoop/mapreduce/TaskAttemptContext.class I

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Sandy Ryza
Hi Sven, What version of Spark are you running? Recent versions have a change that allows PySpark to share a pool of processes instead of starting a new one for each task. -Sandy On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser kras...@gmail.com wrote: Hey all, I am running into a problem

Re: How to replay consuming messages from kafka using spark streaming?

2015-01-23 Thread mykidong
Hi, I have written spark streaming kafka receiver using kafka simple consumer api: https://github.com/mykidong/spark-kafka-simple-consumer-receiver This kafka receiver can be used as alternative to the current spark streaming kafka receiver which is just written in high level kafka consumer api.

Re: Streaming: getting data from Cassandra based on input stream values

2015-01-23 Thread madhu phatak
Hi, In that case, you can try the following. val joinRDD = kafkaStream.transform( streamRDD = { val ids = streamRDD.map(_._2).collect(); ids.map(userId = ctable.select(user_name).where(userid = ?, userId).toArray(0).get[String](0)) // better create a query which checks for all those ids at

Re: While Loop

2015-01-23 Thread Ted Yu
Can you tell us the problem you're facing ? Please see this thread: http://search-hadoop.com/m/JW1q5SsB5m Cheers On Fri, Jan 23, 2015 at 9:02 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there a better programming construct than while loop in Spark? Thank You

While Loop

2015-01-23 Thread Deep Pradhan
Hi, Is there a better programming construct than while loop in Spark? Thank You

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Adam Diaz
Yarn only has the ability to kill not checkpoint or sig suspend. If you use too much memory it will simply kill tasks based upon the yarn config. https://issues.apache.org/jira/browse/YARN-2172 On Friday, January 23, 2015, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Sven, What version of

Re: Aggregations based on sort order

2015-01-23 Thread Imran Rashid
I'm not sure about this, but I suspect the answer is: spark doesn't guarantee a stable sort, nor does it plan to in the future, so the implementation has more flexibility. But you might be interested in the work being done on secondary sort, which could give you the guarantees you want:

Re: Starting a spark streaming app in init.d

2015-01-23 Thread Akhil Das
I'd do the same but put an extra condition to check whether the job has successfully started or not by checking the application ui (port availability 4040 would do, if you want more complex one then write a parser for the same.) after putting the main script on sleep for some time (say 2 minutes).

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Sven Krasser
Hey Adam, I'm not sure I understand just yet what you have in mind. My takeaway from the logs is that the container actually was above its allotment of about 14G. Since 6G of that are for overhead, I assumed there to be plenty of space for Python workers, but there seem to be more of those than

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Davies Liu
It should be a bug, the Python worker did not exit normally, could you file a JIRA for this? Also, could you show how to reproduce this behavior? On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser kras...@gmail.com wrote: Hey Adam, I'm not sure I understand just yet what you have in mind. My

Re: Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Akhil Das
Can you also try increasing the akka framesize? .set(spark.akka.frameSize,50) // Set it to a higher number Thanks Best Regards On Sat, Jan 24, 2015 at 3:58 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: Thanks for the ideas Sven. I'm using stand-alone cluster (Spark 1.2). FWIW, I

Re: Large number of pyspark.daemon processes

2015-01-23 Thread Sven Krasser
Hey Sandy, I'm using Spark 1.2.0. I assume you're referring to worker reuse? In this case I've already set spark.python.worker.reuse to false (but it I also so this behavior when keeping it enabled). Best, -Sven On Fri, Jan 23, 2015 at 4:51 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi

Re: Spark Streaming action not triggered with Kafka inputs

2015-01-23 Thread Akhil Das
ssc.union will return a DStream, you should do something like: val lines = ssc.union(parallelInputs) lines.print() Thanks Best Regards On Sat, Jan 24, 2015 at 12:55 AM, Chen Song chen.song...@gmail.com wrote: I am running into some problems with Spark Streaming when reading from Kafka.I

Re: Installing Spark Standalone to a Cluster

2015-01-23 Thread Akhil Das
Which variable is that you don't understand? Here's a minimalistic spark-env.sh of mine. export SPARK_MASTER_IP=192.168.10.28 export HADOOP_CONF_DIR=/home/akhld/sigmoid/localcluster/hadoop/conf export HADOOP_HOME=/home/akhld/sigmoid/localcluster/hadoop/ Thanks Best Regards On Fri, Jan 23,

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
After I changed the dependency to the following: dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version1.2.1/version exclusions

Re: processing large dataset

2015-01-23 Thread Sean Owen
This is kinda a how-long-is-a-piece-of-string question. There is no one tuning for 'terabytes of data'. You can easily run a Spark job that processes hundreds of terabytes with no problem with defaults -- something trivial like counting. You can create Spark jobs that will never complete -- trying

Re: save a histogram to a file

2015-01-23 Thread madhu phatak
Hi, histogram method return normal scala types not a RDD. So you will not have saveAsTextFile. You can use makeRDD method make a rdd out of the data and saveAsObject file val hist = a.histogram(10) val histRDD = sc.makeRDD(hist) histRDD.saveAsObjectFile(path) On Fri, Jan 23, 2015 at 5:37 AM, SK

Re: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread Sean Owen
So, you should not depend on Hadoop artifacts unless you use them directly. You should mark Hadoop and Spark deps as provided. Then the cluster's version is used at runtime with spark-submit. That's the usual way to do it, which works. If you need to embed Spark in your app and are running it

Re: Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2015-01-23 Thread Marco
Hi, I've exactly the same issue. I've tried to mark the libraries as 'provided' but then IntelliJ IDE seems to have deleted the libraries locallythat is I am not able to build/run the stuff in the IDE. Is the issue resolved ? I'm not very experienced in SBTI've tried to exclude the

Re: Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2015-01-23 Thread Sean Owen
Use mvn dependency:tree or sbt dependency-tree to print all of the dependencies. You are probably bringing in more servlet API libs from some other source? On Fri, Jan 23, 2015 at 10:57 AM, Marco marco@gmail.com wrote: Hi, I've exactly the same issue. I've tried to mark the libraries as

java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

2015-01-23 Thread Nishant Patel
Below is code I have written. I am getting NotSerializableException. How can I handle this scenario? kafkaStream.foreachRDD(rdd = { println() rdd.foreachPartition(partitionOfRecords = { partitionOfRecords.foreach( record = { //Write for CSV.

Re: save a histogram to a file

2015-01-23 Thread Sean Owen
As you can see, the result of histogram() is a pair of arrays, since of course it's small. It's not necessary and in fact is huge overkill to make it back into an RDD so you can save it across a bunch of partitions. This isn't a job for Spark, but simple Scala code. Off the top of my head (maybe

Re: reducing number of output files

2015-01-23 Thread Sean Owen
It does not necessarily shuffle, yes. I believe it will not if you are strictly reducing the number of partitions, and do not force a shuffle. So I think the answer is 'yes'. If you have a huge number of small files, you can also consider wholeTextFiles, which gives you entire files of content in

Re: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

2015-01-23 Thread Sean Owen
Heh, this question keeps coming up. You can't use a context or RDD inside a distributed operation, only from the driver. Here you're trying to call textFile from within foreachPartition. On Fri, Jan 23, 2015 at 10:59 AM, Nishant Patel nishant.k.pa...@gmail.com wrote: Below is code I have

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
I looked into the source code of SparkHadoopMapReduceUtil.scala. I think it is broken in the following code: def newTaskAttemptContext(conf: Configuration, attemptId: TaskAttemptID): TaskAttemptContext = {val klass = firstAvailableClass(

Can't access nested types with sql

2015-01-23 Thread matthes
I try to work with nested parquet data. To read and write the parquet file is actually working now but when I try to query a nested field with SqlContext I get an exception: RuntimeException: Can't access nested field in type ArrayType(StructType(List(StructField(... I generate the parquet file

GroupBy multiple attributes

2015-01-23 Thread Boromir Widas
Hello, I am trying to do a groupBy on 5 attributes to get results in a form like a pivot table in microsoft excel. The keys are the attribute tuples and values are double arrays(maybe very large). Based on the code below, I am getting back correct results, but would like to optimize it further(I

RE: spark 1.1.0 save data to hdfs failed

2015-01-23 Thread ey-chih chow
I also think the code is not robust enough. First, Spark works with hadoop1, why the code try hadoop2 first. Also the following code only handle ClassNotFoundException. It should handle all the exceptions. private def firstAvailableClass(first: String, second: String): Class[_] = { try

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread mingyu
I found a workaround. I can make my auxiliary data a RDD. Partition it and cache it. Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. -- View this message in context: