Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Michael Armbrust
I think this is likely something that we'll want to do during the code generation phase. Though its probably not the lowest hanging fruit at this point. On Sun, May 31, 2015 at 5:02 AM, Reynold Xin wrote: > I think you are looking for > http://en.wikipedia.org/wiki/Common_subexpression_eliminat

Re: problem with using mapPartitions

2015-05-30 Thread unioah
I solved the problem. It was caused by using spark-core_2.11 mvn repository. When I compiled with spark-core_2.10, the problem doesn't show up again. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/problem-with-using-mapPartitions-tp12514p12523.ht

please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-05-30 Thread Reynold Xin
FYI we merged a patch that improves unit test log debugging. In order for that to work, all test suites have been changed to extend SparkFunSuite instead of ScalaTest's FunSuite. We also added a rule in the Scala style checker to fail Jenkins if FunSuite is used. The patch that introduced SparkFun

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-30 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.3.1 2.1. statistics (min,max,mean,Pearson,Spe

Re: problem with using mapPartitions

2015-05-30 Thread unioah
Thank you for your reply. But the typo is not reason for the problem. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/problem-with-using-mapPartitions-tp12514p12520.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
We added all the typetags for arguments but haven't got around to use them yet. I think it'd make sense to have them and do the auto cast, but we can have rules in analysis to forbid certain casts (e.g. don't auto cast double to int). On Sat, May 30, 2015 at 7:12 AM, Justin Uang wrote: > The id

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Reynold Xin
I think you are looking for http://en.wikipedia.org/wiki/Common_subexpression_elimination in the optimizer. One thing to note is that as we do more and more optimization like this, the optimization time might increase. Do you see a case where this can bring you substantial performance gains? On

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Reynold Xin
Name resolution is not as easy I think. Wenchen can maybe give you some advice on resolution about this one. On Sat, May 30, 2015 at 9:37 AM, Yijie Shen wrote: > I think just match the Column’s expr as UnresolvedAttribute and use > UnresolvedAttribute’s name to match schema’s field name is eno

Re: problem with using mapPartitions

2015-05-30 Thread Ted Yu
bq. val result = fDB.mappartitions(testMP).collect Not sure if you pasted the above code - there was a typo: method name should be mapPartitions Cheers On Sat, May 30, 2015 at 9:44 AM, unioah wrote: > Hi, > > I try to aggregate the value in each partition internally. > For example, > > Bef

Re: ClosureCleaner slowing down Spark SQL queries

2015-05-30 Thread Nitin Goyal
Thanks Josh and Yin. Created following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-7970 Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tp12466p12515.html Sent from the

problem with using mapPartitions

2015-05-30 Thread unioah
Hi, I try to aggregate the value in each partition internally. For example, Before: worker 1:worker 2: 1, 2, 1 2, 1, 2 After: worker 1: worker 2: (1->2), (2->1) (1->1), (2->2) I try to use mappartitions, object MyTest {

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Yijie Shen
I think just match the Column’s expr as UnresolvedAttribute and use UnresolvedAttribute’s name to match schema’s field name is enough. Seems no need to regard expr as a more general one. :) On May 30, 2015 at 11:14:05 PM, Girardot Olivier (o.girar...@lateral-thoughts.com) wrote: Jira done : ht

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
On second thought, perhaps can this be done by writing a rule that builds the dag of dependencies between expressions, then convert it into several layers of projections, where each new layer is allowed to depend on expression results from previous projections? Are there any pitfalls to this appro

Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] > PhysicalRDD [date#56,id#57,timestamp#58,x

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Olivier Girardot
Jira done : https://issues.apache.org/jira/browse/SPARK-7969 I've already started working on it but it's less trivial than it seems because I don't exactly now the inner workings of the catalog, and how to get the qualified name of a column to match it against the schema/catalog. Regards, Olivier

Re: Using UDFs in Java without registration

2015-05-30 Thread Justin Uang
The idea of asking for both the argument and return class is interesting. I don't think we do that for the scala APIs currently, right? In functions.scala, we only use the TypeTag for RT. def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = { UserDefinedFunction(f,

Sidebar: issues targeted for 1.4.0

2015-05-30 Thread Sean Owen
No 1.4.0 Blockers at this point, which is great. Forking this thread to discuss something else. There are 92 issues targeted for 1.4.0, 28 of which are marked Critical. Many are procedural issues like "update docs for 1.4" or "check X for 1.4". Are these resolved? They sound like things that are d

Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Ted Yu
I downloaded source tar ball and ran command similar to following with: clean package -DskipTests Then I ran the following command. Fyi > On May 30, 2015, at 12:42 AM, Tathagata Das wrote: > > Did was it a clean compilation? > > TD > >> On Fri, May 29, 2015 at 10:48 PM, Ted Yu wrote: >

Re: Dataframe's .drop in PySpark doesn't accept Column

2015-05-30 Thread Reynold Xin
Yea would be great to support a Column. Can you create a JIRA, and possibly a pull request? On Fri, May 29, 2015 at 2:45 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Actually, the Scala API too is only based on column name > > Le ven. 29 mai 2015 à 11:23, Olivier Girardot < >

Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Tathagata Das
Did was it a clean compilation? TD On Fri, May 29, 2015 at 10:48 PM, Ted Yu wrote: > Hi, > I ran the following command on 1.4.0 RC3: > > mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package > > I saw the following failure: > > ^[[32mStreamingContextSuite:^[[0m > ^[[32m- from no conf co

Re: Using UDFs in Java without registration

2015-05-30 Thread Reynold Xin
I think you are right that there is no way to call Java UDF without registration right now. Adding another 20 methods to functions would be scary. Maybe the best way is to have a companion object for UserDefinedFunction, and define UDF there? e.g. object UserDefinedFunction { def define(f: org