Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Pramod Biligiri
Thanks for the info. I agree, it makes sense the way it is designed. Pramod On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan wrote: > I agree, this is better handled by the filesystem cache - not to > mention, being able to do zero copy writes. > > Regards, > Mridul > > On Sat, May 2, 2015

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Reynold Xin
Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some of the cross-language differences confusing, but I'd argue most real users just stick to one language, and developers or trainers are the only ones

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Mridul Muralidharan
I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes. Regards, Mridul On Sat, May 2, 2015 at 10:26 PM, Reynold Xin wrote: > I've personally prototyped completely in-memory shuffle for Spark 3 times. > However, it is unclear how big of a gain

Submit & Kill Spark Application program programmatically from another application

2015-05-02 Thread Yijie Shen
Hi, I’ve posted this problem in user@spark but find no reply, therefore moved to dev@spark, sorry for duplication. I am wondering if it is possible to submit, monitor & kill spark applications from another service. I have wrote a service this: parse user commands translate them into understan

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Reynold Xin
I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear how big of a gain it would be to put all of these in memory, under newer file systems (ext4, xfs). If the shuffle data is small, they are still in the file system buffer cache anyway. Note that network

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Mridul Muralidharan
Hi Shane, Since we are still maintaining support for jdk6, jenkins should be using jdk6 [1] to ensure we do not inadvertently use jdk7 or higher api which breaks source level compat. -source and -target is insufficient to ensure api usage is conformant with the minimum jdk version we are support

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Koert Kuipers
i think i might be misunderstanding, but shouldnt java 6 currently be used in jenkins? On Sat, May 2, 2015 at 11:53 PM, shane knapp wrote: > that's kinda what we're doing right now, java 7 is the default/standard on > our jenkins. > > or, i vote we buy a butler's outfit for thomas and have a sec

Re: [discuss] ending support for Java 6?

2015-05-02 Thread shane knapp
that's kinda what we're doing right now, java 7 is the default/standard on our jenkins. or, i vote we buy a butler's outfit for thomas and have a second jenkins instance... ;) On Sat, May 2, 2015 at 1:09 PM, Mridul Muralidharan wrote: > We could build on minimum jdk we support for testing pr's

Re: What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Patrick Wendell
Maybe I can help a bit. What happens when you call .map(my func) is that you create a MapPartitionsRDD that has a reference to that closure in it's compute() function. When a job is run (jobs are run as the result of RDD actions): https://github.com/apache/spark/blob/master/core/src/main/scala/org

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Ted Yu
+1 On Sat, May 2, 2015 at 1:09 PM, Mridul Muralidharan wrote: > We could build on minimum jdk we support for testing pr's - which will > automatically cause build failures in case code uses newer api ? > > Regards, > Mridul > > On Fri, May 1, 2015 at 2:46 PM, Reynold Xin wrote: > > It's really

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Mridul Muralidharan
We could build on minimum jdk we support for testing pr's - which will automatically cause build failures in case code uses newer api ? Regards, Mridul On Fri, May 1, 2015 at 2:46 PM, Reynold Xin wrote: > It's really hard to inspect API calls since none of us have the Java > standard library in

Re: Pandas' Shift in Dataframe

2015-05-02 Thread Olivier Girardot
To close this thread rxin created a broader Jira to handle window functions in Dataframes : https://issues.apache.org/jira/browse/SPARK-7322 Thanks everyone. Le mer. 29 avr. 2015 à 22:51, Olivier Girardot < o.girar...@lateral-thoughts.com> a écrit : > To give you a broader idea of the current use

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Reynold Xin
It's really hard to inspect API calls since none of us have the Java standard library in our brain. The only way we can enforce this is to have it in Jenkins, and Tom you are currently our mini-Jenkins server :) Joking aside, looks like we should support Java 6 in 1.4, and in the release notes inc

What is the location in the source code of the computation of the elements in a map transformation?

2015-05-02 Thread Tom Hubregtsen
I am trying to understand what the data and computation flow is in Spark, and believe I fairly understand the Shuffle (both map and reduce side), but I do not get what happens to the computation from the map stages. I know all maps gets pipelined on the shuffle (when there is no other action in bet

createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Olivier Girardot
Hi everyone, SQLContext.createDataFrame has different behaviour in Scala or Python : >>> l = [('Alice', 1)] >>> sqlContext.createDataFrame(l).collect() [Row(_1=u'Alice', _2=1)] >>> sqlContext.createDataFrame(l, ['name', 'age']).collect() [Row(name=u'Alice', age=1)] and in Scala : scala> val data

Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Pramod Biligiri
Hi, I was trying to see if I can make Spark avoid hitting the disk for small jobs, but I see that the SortShuffleWriter.write() always writes to disk. I found an older thread ( http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html) saying that it doesn't call