Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Yes, I need to call the external service for every event and the order does not matter. There's no time limit in which each events should be processed. I can't tell the producer to slow down nor drop events. Of course I could put a message broker in between like an AMQP or JMS broker but I was thin

Re: Contribution to Spark MLLib

2014-06-18 Thread Xiangrui Meng
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar with the modification you mentioned since the paper is new. We may need to spend more time to learn the trade-offs. Feel free to create a JIRA for PLSA and we can move our discussion there. It would be great if you can share your

Re: Contribution to Spark MLLib

2014-06-18 Thread Jayati
Hello Xiangrui, I am looking at the Spark Issues, but just wanted to know, if it is mandatory for me to work on existing JIRAs before I can contribute to MLLib. Regards, Jayati -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-t

python worker crash in spark 1.0

2014-06-18 Thread Schein, Sagi
Hi, I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd behavior. When, in pyspark, I invoke sc.textFile("hdfs://hadoop-ha01:/user/x/events_2.1").take(1) the call crashes with the below stack trace. The file resides in hadoop 2.2, it is a large event data,

Re: Best practices for removing lineage of a RDD or Graph object?

2014-06-18 Thread dash
Hi Roy, Thanks for your help, I write a small code snippet that could reproduce the problem. Could you help me read through it and see if I did anything wrong? Thanks! def main(args: Array[String]) { val conf = new SparkConf().setAppName(“TEST") .setMaster("local[4]") .set("s

Re: Best practices for removing lineage of a RDD or Graph object?

2014-06-18 Thread Andrea Esposito
No sure if it can help, btw: Checkpoint cuts the lineage. The checkpoint method is a flag. In order to actually perform the checkpoint you must do NOT materialise the RDD before it has been flagged otherwise the flag is just ignored. rdd2 = rdd1.map(..) rdd2.checkpoint() rdd2.count rdd2.isCheckpoi

Re: Un-serializable 3rd-party classes (Spark, Java)

2014-06-18 Thread Daedalus
Kryo did the job. Thanks! On Wed, Jun 18, 2014 at 10:44 AM, Matei Zaharia [via Apache Spark User List] wrote: > There are a few options: > > - Kryo might be able to serialize these objects out of the box, depending > what’s inside them. Try turning it on as described at > http://spark.apache.or

options set in spark-env.sh is not reflecting on actual execution

2014-06-18 Thread MEETHU MATHEW
Hi all, I have a doubt regarding the options in spark-env.sh. I set the following values in the file in master and 2 workers SPARK_WORKER_MEMORY=7g SPARK_EXECUTOR_MEMORY=6g SPARK_DAEMON_JAVA_OPTS+="- Dspark.akka.timeout=30 -Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80

Fwd: BSP realization on Spark

2014-06-18 Thread Ghousia
-- Forwarded message -- From: Ghousia Date: Wed, Jun 18, 2014 at 5:41 PM Subject: BSP realization on Spark To: user@spark.apache.org Hi, We are trying to implement a BSP model in Spark with the help of GraphX. One thing I encountered is a Pregel operator in Graph class. But what

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-18 Thread Bharath Ravi Kumar
Thanks. I'll await the fix to re-run my test. On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng wrote: > Hi Bharath, > > This is related to SPARK-1112, which we already found the root cause. > I will let you know when this is fixed. > > Best, > Xiangrui > > On Tue, Jun 17, 2014 at 7:37 PM, Bharath

Re: Execution stalls in LogisticRegressionWithSGD

2014-06-18 Thread Xiangrui Meng
Hi Bharath, This is related to SPARK-1112, which we already found the root cause. I will let you know when this is fixed. Best, Xiangrui On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar wrote: > Couple more points: > 1)The inexplicable stalling of execution with large feature sets appears >

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
Flavio - i'm new to Spark as well but I've done stream processing using other frameworks. My comments below are not spark-streaming specific. Maybe someone who know more can provide better insights. I read your post on my phone and I believe my answer doesn't completely address the issue you have

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
This is exciting! Here is the relevant "alpha" doc for this feature, for others reading this. I'm going to try this out. Will this be released with 1.1.0? On Wed, Jun 18, 2014 at 8:31 PM, Zongheng Yang wrote: > If your inpu

Re: Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
Looks like eventually there was some type of reset or timeout and the tasks have been reassigned. I'm guessing they'll keep failing until max failure count. The machine it disconnected from was a remote machine, though I've seen such failures from connections to itself with other problems. The log

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Zongheng Yang
If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas wrote: > That’s pretty neat! So I guess if you start with an RDD of objec

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
That’s pretty neat! So I guess if you start with an RDD of objects, you’d first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'], ...)) in order to register it as a table, and from there run your aggregates. Very nice. ​ On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks wrote:

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Matei Zaharia
I was going to suggest the same thing :). On Jun 18, 2014, at 4:56 PM, Evan R. Sparks wrote: > This looks like a job for SparkSQL! > > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext._ > case class MyRecord(country: String, name: String, age: Int, hits: Long) > v

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks
This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), MyRecord("USA", "Bob", 55, 108), MyRecord

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nicholas Chammas
Ah, this looks like exactly what I need! It looks like this was recently added into PySpark (and Spark Core), but it's not in the 1.0.0 release. Thank you. Nick On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin wrote: > Hi Nick, > > Instead of

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Doris Xin
Hi Nick, Instead of using reduceByKey(), you might want to look into using aggregateByKey(), which allows you to return a different value type U instead of the input value type V for each input tuple (K, V). You can define U to be a datatype that holds both the average and total and have seqOp upd

Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nick Chammas
The following is a simplified example of what I am trying to accomplish. Say I have an RDD of objects like this: { "country": "USA", "name": "Franklin", "age": 24, "hits": 224} { "country": "USA", "name": "Bob", "age": 55, "hits": 108} { "country": "France",

Trailing Tasks Saving to HDFS

2014-06-18 Thread Surendranauth Hiraman
I have a flow that ends with saveAsTextFile() to HDFS. It seems all the expected files per partition have been written out, based on the number of part files and the file sizes. But the driver logs show 2 tasks still not completed and has no activity and the worker logs show no activity for those

create SparkContext dynamically

2014-06-18 Thread jamborta
Hi all, I am setting up a system where spark contexts would be created by a web server that would handle the computation and return the results. I have the following code (in python) os.environ['SPARK_HOME'] = "/home/spark/spark-1.0.0-bin-hadoop2/" sc = SparkContext(master="spark://ip-xx-xx-xx-xx

Re: Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Thanks for the quick reply soumya. Unfortunately I'm a newbie with Spark..what do you mean? is there any reference to how to do that? On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta wrote: > > You can add a back pressured enabled component in front that feeds data > into Spark. This component c

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. > On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier wrote: > > Hi to all, > in my use case I'd like to receive events and call an external service as > they pa

Re: Issue while trying to aggregate with a sliding window

2014-06-18 Thread Hatch M
Ok that patch does fix the key lookup exception. However, curious about the time validity check..isValidTime ( https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L264 ) Why does (time - zerotime) have to be a multiple of slide dura

Spark streaming and rate limit

2014-06-18 Thread Flavio Pompermaier
Hi to all, in my use case I'd like to receive events and call an external service as they pass through. Is it possible to limit the number of contemporaneous call to that service (to avoid DoS) using Spark streaming? if so, limiting the rate implies a possible buffer growth...how can I control the

java.lang.OutOfMemoryError with saveAsTextFile

2014-06-18 Thread Muttineni, Vinay
Hi, I have a 5 million record, 300 column data set. I am running a spark job in yarn-cluster mode, with the following args --driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors 500 The spark job replaces all categorical variables with some integers. I am getting the below

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
>From the user perspective, I don't think it's a big deal either way. It looks like contributors to Homebrew are pretty on top of keeping things updated. For the users, Apache managing this would mostly mean a) a shorter period of time from release to brew availability, and b) the brew installatio

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Andrew Ash
Wait, so the file only has four lines and the job running out of heap space? Can you share the code you're running that does the processing? I'd guess that you're doing some intense processing on every line but just writing parsed case classes back to disk sounds very lightweight. I On Wed, Ju

Re: Spark is now available via Homebrew

2014-06-18 Thread Andrew Ash
What's the advantage of Apache maintaining the brew installer vs users? Apache handling it means more work on this dev team, but probably a better experience for brew users. Just wanted to weigh pros/cons before committing to support this installation method. Andrew On Wed, Jun 18, 2014 at 5:2

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Matei, You might want to comment on that issue Sherl linked to, or perhaps this one , to ask about how Apache can manage this going forward. I know that mikemcquaid is very active on the Homebrew repo and is one of

Re: Spark is now available via Homebrew

2014-06-18 Thread Nicholas Chammas
Agreed, it would be better if Apache controlled or managed this directly. I think making such a change is just a matter of opening a new issue on the Homebrew issue tracker. I believe that's how Spark made it in there in the first place--it was jus

Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Shivani Rao
I am trying to process a file that contains 4 log lines (not very long) and then write my parsed out case classes to a destination folder, and I get the following error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java

Re: Spark is now available via Homebrew

2014-06-18 Thread Sheryl John
Cool. Looked at the Pull Requests, the upgrade to 1.0.0 was just merged yesterday. https://github.com/Homebrew/homebrew/pull/30231 https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb On Wed, Jun 18, 2014 at 1:57 PM, Matei Zaharia wrote: > Interesting, does anyone k

Re: No Intercept for Python

2014-06-18 Thread Naftali Harris
Thanks Reza! :-D Naftali On Wed, Jun 18, 2014 at 1:47 PM, Reza Zadeh wrote: > Hi Naftali, > > Yes you're right. For now please add a column of ones. We are working on > adding a weighted regularization term, and exposing the scala intercept > option in the python binding. > > Best, > Reza > >

Re: Spark is now available via Homebrew

2014-06-18 Thread Matei Zaharia
Interesting, does anyone know the people over there who set it up? It would be good if Apache itself could publish packages there, though I’m not sure what’s involved. Since Spark just depends on Java and Python it should be easy for us to update. Matei On Jun 18, 2014, at 1:37 PM, Nick Chamma

Re: No Intercept for Python

2014-06-18 Thread Reza Zadeh
Hi Naftali, Yes you're right. For now please add a column of ones. We are working on adding a weighted regularization term, and exposing the scala intercept option in the python binding. Best, Reza On Mon, Jun 16, 2014 at 12:19 PM, Naftali Harris wrote: > Hi everyone, > > The Python LogisticR

Spark is now available via Homebrew

2014-06-18 Thread Nick Chammas
OS X / Homebrew users, It looks like you can now download Spark simply by doing: brew install apache-spark I’m new to Homebrew, so I’m not too sure how people are intended to use this. I’m guessing this would just be a convenient way to get the latest release onto your workstation, and from ther

Re: Unit test failure: Address already in use

2014-06-18 Thread Philip Ogren
In my unit tests I have a base class that all my tests extend that has a setup and teardown method that they inherit. They look something like this: var spark: SparkContext = _ @Before def setUp() { Thread.sleep(100L) //this seems to give spark more time to reset from the

RE: Unit test failure: Address already in use

2014-06-18 Thread Lisonbee, Todd
Disabling parallelExecution has worked for me. Other alternatives I’ve tried that also work include: 1. Using a lock – this will let tests execute in parallel except for those using a SparkContext. If you have a large number of tests that could execute in parallel, this can shave off some tim

Re: question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-18 Thread santhoma
by the way, any idea how to sync the spark config dir with other nodes in the cluster? ~santhosh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/question-about-setting-SPARK-CLASSPATH-IN-spark-env-sh-tp7809p7853.html Sent from the Apache Spark User List mai

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts is that this should be part of the staging and should be cleaned up as well when sc gets terminated. [tes

HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Hi All, Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys th

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
I wonder if that’s the problem. Is there an equivalent hadoop fs -ls command you can run that returns the same files you want but doesn’t have that month= string? ​ On Wed, Jun 18, 2014 at 12:25 PM, Jianshi Huang wrote: > Hi Nicholas, > > month= is for Hive to auto discover the partitions. It's

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
Hi Nicholas, month= is for Hive to auto discover the partitions. It's part of the url of my files. Jianshi On Wed, Jun 18, 2014 at 11:52 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Is that month= syntax something special, or do your files actually have > that string as part of

Re: Wildcard support in input path

2014-06-18 Thread Nicholas Chammas
Is that month= syntax something special, or do your files actually have that string as part of their name? ​ On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang wrote: > Hi all, > > Thanks for the reply. I'm using parquetFile as input, is that a problem? > In hadoop fs -ls, the path (hdfs://domain/u

Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
" if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem cache?" The answer to the question of whether to persist (spill-over) data on disk is not always immediately clear, because generall

Re: rdd.cache() is not faster?

2014-06-18 Thread Wei Tan
Hi Gaurav, thanks for your pointer. The observation in the link is (at least qualitatively) similar to mine. Now the question is, if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem c

Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Sguj
I got rid of most of my heap errors by increasing the number of partitions of my RDDs by 8-16x. I found in the tuning page that heap space errors can be caused by a hash table that's generated during the shuffle functions, so by splitting up how

Re: Contribution to Spark MLLib

2014-06-18 Thread Denis Turdakov
Hello everybody, Xiangrui, thanks for the link to roadmap. I saw it is planned to implement LDA in the MLlib 1.1. What do you think about PLSA? I understand that LDA is more popular now, but recent research shows that modifications of PLSA sometimes performs better[1]. Furthermore, the most rece

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-18 Thread Surendranauth Hiraman
Patrick, My team is using shuffle consolidation but not speculation. We are also using persist(DISK_ONLY) for caching. Here are some config changes that are in our work-in-progress. We've been trying for 2 weeks to get our production flow (maybe around 50-70 stages, a few forks and joins with up

BSP realization on Spark

2014-06-18 Thread Ghousia
Hi, We are trying to implement a BSP model in Spark with the help of GraphX. One thing I encountered is a Pregel operator in Graph class. But what I fail to understand is how the Master and Worker needs to be assigned (BSP), and how barrier synchronization would happen. The pregel operator provide

Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
In the test application, I create a DStream by connect with a socket. Then I want to count the RDDs in the DStream which matches with another reference RDD. Below is the Java code for my application. == public class TestSparkStreaming { public static void main(String[] args) {

Re: Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
I guess this is a basic question about the usage of reduce. Please shed some lights, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html Sent from the Apache Spark User List mailing lis

Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
You cannot assume that caching would always reduce the execution time, especially if the data-set is large. It appears that if too much memory is used for caching, then less memory is left for the actual computation itself. There has to be a balance between the two. Page 33 of this thesis from KT

Re: Enormous EC2 price jump makes "r3.large" patch more important

2014-06-18 Thread Jeremy Lee
Hmm.. kinda working... I'm getting a broken apache/ganglion at the last step, although spark-shell does run. Starting GANGLIA gmetad: [ OK ] Stopping httpd:[FAILED] Starting httpd: httpd: Syntax error on line 153 of /

Shark Tasks in parallel execution

2014-06-18 Thread majian
HI,all I confuse that why shark window don't execute queries also occupied resources ? In the case of shark window is not closed how can parallel execution multiple queries? for example Spark config: total Nodes: 3 Spark total cores :9 Shark config: SPARK_JAVA_OPTS+="-Dspark.scheduler.a

Re: get schema from SchemaRDD

2014-06-18 Thread Michael Armbrust
We just merged a feature into master that lets you print the schema or view it as a string (printSchema() and schemaTreeString on SchemaRDD). There is also this JIRA targeting 1.1 for presenting a nice programatic API for this information: https://issues.apache.org/jira/browse/SPARK-2179 On Wed,

get schema from SchemaRDD

2014-06-18 Thread Kevin Jung
Can I get schema information from SchemaRDD? For example, *case class Person(name:String, Age:Int, Gender:String, Birth:String) val peopleRDD = sc.textFile("/sample/sample.csv").map(_.split(",")).map(p => Person(p(0).toString, p(1).toInt, p(2).toString, p(3).toString)) peopleRDD.saveAsParquetFile(

Re: Unit test failure: Address already in use

2014-06-18 Thread Anselme Vignon
Hi, Could your problem come from the fact that you run your tests in parallel ? If you are spark in local mode, you cannot have concurrent spark instances running. this means that your tests instantiating sparkContext cannot be run in parallel. The easiest fix is to tell sbt to not run parallel t

Re: join operation is taking too much time

2014-06-18 Thread MEETHU MATHEW
Hi, Thanks Andrew and Daniel for the response. Setting spark.shuffle.spill to false didnt make any difference. 5 days   completed in 6 min and 10 days was stuck after around 1hr. Daniel,in my current use case I cant read all the files to a single RDD.But I have another use case where I did it i

Re: Spark Streaming Example with CDH5

2014-06-18 Thread Sean Owen
There is nothing special about CDH5 Spark in this regard. CDH 5.0.x has Spark 0.9.0, and the imminent next release will have 1.0.0 + upstream patches. You're simply accessing a class that was not present in 0.9.0, but is present after that: https://github.com/apache/spark/commits/master/core/src/