Re: [SQL][PYTHON] UDF improvements.

2017-01-10 Thread Maciej Szymkiewicz
Thanks for your response Ryan. Here you are
https://issues.apache.org/jira/browse/SPARK-19159


On 01/09/2017 07:30 PM, Ryan Blue wrote:
> Maciej, this looks great.
>
> Could you open a JIRA issue for improving the @udf decorator and
> possibly sub-tasks for the specific features from the gist? Thanks!
>
> rb
>
> On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz
> mailto:mszymkiew...@gmail.com>> wrote:
>
> Hi,
>
> I've been looking at the PySpark UserDefinedFunction and I have a
> couple of suggestions how it could be improved including:
>
>   * Full featured decorator syntax.
>   * Docstring handling improvements.
>   * Lazy initialization.
>
> I summarized all suggestions with links to possible solutions in
> gist
> (https://gist.github.com/zero323/88953975361dbb6afd639b35368a97b4
> )
> and I'll be happy to open a JIRA and submit a PR if there is any
> interest in that.
>
> -- 
> Best,
> Maciej
>
>
>
>
> -- 
> Ryan Blue
> Software Engineer
> Netflix

-- 
Best,
Maciej



Re: Tests failing with GC limit exceeded

2017-01-10 Thread shane knapp
quick update:

things are looking slightly...  better.  the number of failing builds
due to GC overhead has decreased slightly since the reboots last
week...  in fact, in the last three days the only builds to be
affected are spark-master-test-maven-hadoop-2.7 (three failures) and
spark-master-test-maven-hadoop-2.6 (five failures).

overall percentages (over two weeks) have also dropped from ~9% to
~7%, so at least the rate of failure is dropping.

so, the while we're still bleeding, it's slowed down a bit.  we'll
still need to audit the java heap size allocs in the various tests,
however.

shane

On Fri, Jan 6, 2017 at 1:06 PM, shane knapp  wrote:
> (adding michael armbrust and josh rosen for visibility)
>
> ok.  roughly 9% of all spark tests builds (including both PRB builds
> are failing due to GC overhead limits.
>
> $ wc -l SPARK_TEST_BUILDS GC_FAIL
>  1350 SPARK_TEST_BUILDS
>   125 GC_FAIL
>
> here are the affected builds (over the past ~2 weeks):
> $ sort builds.raw | uniq -c
>   6 NewSparkPullRequestBuilder
>   1 spark-branch-2.0-test-sbt-hadoop-2.6
>   6 spark-branch-2.1-test-maven-hadoop-2.7
>   1 spark-master-test-maven-hadoop-2.4
>  10 spark-master-test-maven-hadoop-2.6
>  12 spark-master-test-maven-hadoop-2.7
>   5 spark-master-test-sbt-hadoop-2.2
>  15 spark-master-test-sbt-hadoop-2.3
>  11 spark-master-test-sbt-hadoop-2.4
>  16 spark-master-test-sbt-hadoop-2.6
>  22 spark-master-test-sbt-hadoop-2.7
>  20 SparkPullRequestBuilder
>
> please note i also included the spark 1.6 test builds in there just to
> check...  they last ran ~1 month ago, and had no GC overhead failures.
> this leads me to believe that this behavior is quite recent.
>
> so yeah...  looks like we (someone other than me?) needs to take a
> look at the sbt and maven java opts.  :)
>
> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-10 Thread Andy Dang
Thanks. It appears that TypedImperativeAggregate won't be available till
2.2.x. I'm stuck with my RDD approach then :(

---
Regards,
Andy

On Tue, Jan 10, 2017 at 2:01 AM, Liang-Chi Hsieh  wrote:

>
> Hi Andy,
>
> Because hash-based aggregate uses unsafe row as aggregation states, so the
> aggregation buffer schema must be mutable types in unsafe row.
>
> If you can use TypedImperativeAggregate to implement your aggregation
> function, SparkSQL has ObjectHashAggregateExec which supports hash-based
> aggregate using arbitrary JVM objects as aggregation states.
>
>
>
> Andy Dang wrote
> > Hi Takeshi,
> >
> > Thanks for the answer. My UDAF aggregates data into an array of rows.
> >
> > Apparently this makes it ineligible to using Hash-based aggregate based
> on
> > the logic at:
> > https://github.com/apache/spark/blob/master/sql/core/
> src/main/java/org/apache/spark/sql/execution/
> UnsafeFixedWidthAggregationMap.java#L74
> > https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/java/org/apache/spark/sql/catalyst/
> expressions/UnsafeRow.java#L108
> >
> > The list of support data type is VERY limited unfortunately.
> >
> > It doesn't make sense to me that data type must be mutable for the UDAF
> to
> > use hash-based aggregate, but I could be missing something here :). I
> > could
> > achieve hash-based aggregate by turning this query to RDD mode, but that
> > is
> > counter intuitive IMO.
> >
> > ---
> > Regards,
> > Andy
> >
> > On Mon, Jan 9, 2017 at 2:05 PM, Takeshi Yamamuro <
>
> > linguin.m.s@
>
> > >
> > wrote:
> >
> >> Hi,
> >>
> >> Spark always uses hash-based aggregates if the types of aggregated data
> >> are supported there;
> >> otherwise, spark fails to use hash-based ones, then it uses sort-based
> >> ones.
> >> See: https://github.com/apache/spark/blob/master/sql/
> >> core/src/main/scala/org/apache/spark/sql/execution/
> >> aggregate/AggUtils.scala#L38
> >>
> >> So, I'm not sure about your query though, it seems the types of
> >> aggregated
> >> data in your query
> >> are not supported for hash-based aggregates.
> >>
> >> // maropu
> >>
> >>
> >>
> >> On Mon, Jan 9, 2017 at 10:52 PM, Andy Dang <
>
> > namd88@
>
> > > wrote:
> >>
> >>> Hi all,
> >>>
> >>> It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
> >>> which is very inefficient for certain aggration:
> >>>
> >>> The code is very simple:
> >>> - I have a UDAF
> >>> - What I want to do is: dataset.groupBy(cols).agg(udaf).count()
> >>>
> >>> The physical plan I got was:
> >>> *HashAggregate(keys=[], functions=[count(1)], output=[count#67L])
> >>> +- Exchange SinglePartition
> >>>+- *HashAggregate(keys=[], functions=[partial_count(1)],
> >>> output=[count#71L])
> >>>   +- *Project
> >>>  +- Generate explode(internal_col#31), false, false,
> >>> [internal_col#42]
> >>> +- SortAggregate(key=[key#0],
> >>> functions=[aggregatefunction(key#0,
> >>> nested#1, nestedArray#2, nestedObjectArray#3, value#4L,
> >>> com.[...]uDf@108b121f, 0, 0)], output=[internal_col#31])
> >>>+- *Sort [key#0 ASC], false, 0
> >>>   +- Exchange hashpartitioning(key#0, 200)
> >>>  +- SortAggregate(key=[key#0],
> >>> functions=[partial_aggregatefunction(key#0, nested#1, nestedArray#2,
> >>> nestedObjectArray#3, value#4L, com.[...]uDf@108b121f, 0, 0)],
> >>> output=[key#0,internal_col#37])
> >>> +- *Sort [key#0 ASC], false, 0
> >>>+- Scan ExistingRDD[key#0,nested#1,nes
> >>> tedArray#2,nestedObjectArray#3,value#4L]
> >>>
> >>> How can I make Spark to use HashAggregate (like the count(*)
> expression)
> >>> instead of SortAggregate with my UDAF?
> >>>
> >>> Is it intentional? Is there an issue tracking this?
> >>>
> >>> ---
> >>> Regards,
> >>> Andy
> >>>
> >>
> >>
> >>
> >> --
> >> ---
> >> Takeshi Yamamuro
> >>
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/How-to-hint-Spark-
> to-use-HashAggregate-for-UDAF-tp20526p20531.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [SQL][CodeGen] Is there a way to set break point and debug the generated code?

2017-01-10 Thread Reynold Xin
It's unfortunately difficult to debug -- that's one downside of codegen.
You can dump all the code via "explain codegen" though. That's typically
enough for me to debug.


On Tue, Jan 10, 2017 at 3:21 AM, dragonly  wrote:

> I am recently hacking into the SparkSQL and trying to add some new udts and
> functions, as well as some new Expression classes. I run into the problem
> of
> the return type of nullSafeEval method. In one of the new Expression
> classes, I want to return an array of my udt, and my code is like `return
> new GenericArrayData(Array[udt](the array))`. my dataType of the new
> Expression class is like `ArrayType(new MyUDT(), containsNull = false)`.
> And
> I finally get an java object type conversion error.
>
> So I tried to debug into the code and see where the conversion happened,
> only to found that after some generated code execution, I stepped into the
> GenericArrayData.getAs[T](ordinal: Int) method, and find the ordinal
> always
> 0. So here's the problem: SparkSQL is getting the 0th element out of the
> GenericArrayData and treat it as a MyUDT, but I told it to treat the output
> of the Expression class as ArrayType of MyUDT.
>
> It's obscure to me how this ordinal variable comes in and is always 0. Is
> there a way of debugging into the generated code?
>
> PS: just reading the code generation part without jumping back and forth is
> really not cool :/
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SQL-CodeGen-Is-
> there-a-way-to-set-break-point-and-debug-the-generated-code-tp20535.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark performance tests

2017-01-10 Thread Prasun Ratn
Thanks Adam, Kazuaki!

On Tue, Jan 10, 2017 at 3:28 PM, Adam Roberts  wrote:
> Hi, I suggest HiBench and SparkSqlPerf, HiBench features many benchmarks
> within it that exercise several components of Spark (great for stressing
> core, sql, MLlib capabilities), SparkSqlPerf features 99 TPC-DS queries
> (stressing the DataFrame API and therefore the Catalyst optimiser), both
> work well with Spark 2
>
> HiBench: https://github.com/intel-hadoop/HiBench
> SparkSqlPerf: https://github.com/databricks/spark-sql-perf
>
>
>
>
> From:"Kazuaki Ishizaki" 
> To:Prasun Ratn 
> Cc:Apache Spark Dev 
> Date:10/01/2017 09:22
> Subject:Re: Spark performance tests
> 
>
>
>
> Hi,
> You may find several micro-benchmarks under
> https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark.
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:Prasun Ratn 
> To:Apache Spark Dev 
> Date:2017/01/10 12:52
> Subject:Spark performance tests
> 
>
>
>
> Hi
>
> Are there performance tests or microbenchmarks for Spark - especially
> directed towards the CPU specific parts? I looked at spark-perf but
> that doesn't seem to have been updated recently.
>
> Thanks
> Prasun
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SQL][CodeGen] Is there a way to set break point and debug the generated code?

2017-01-10 Thread dragonly
I am recently hacking into the SparkSQL and trying to add some new udts and
functions, as well as some new Expression classes. I run into the problem of
the return type of nullSafeEval method. In one of the new Expression
classes, I want to return an array of my udt, and my code is like `return
new GenericArrayData(Array[udt](the array))`. my dataType of the new
Expression class is like `ArrayType(new MyUDT(), containsNull = false)`. And
I finally get an java object type conversion error.

So I tried to debug into the code and see where the conversion happened,
only to found that after some generated code execution, I stepped into the
GenericArrayData.getAs[T](ordinal: Int) method, and find the ordinal always
0. So here's the problem: SparkSQL is getting the 0th element out of the
GenericArrayData and treat it as a MyUDT, but I told it to treat the output
of the Expression class as ArrayType of MyUDT.

It's obscure to me how this ordinal variable comes in and is always 0. Is
there a way of debugging into the generated code?

PS: just reading the code generation part without jumping back and forth is
really not cool :/



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-CodeGen-Is-there-a-way-to-set-break-point-and-debug-the-generated-code-tp20535.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark performance tests

2017-01-10 Thread Adam Roberts
Hi, I suggest HiBench and SparkSqlPerf, HiBench features many benchmarks 
within it that exercise several components of Spark (great for stressing 
core, sql, MLlib capabilities), SparkSqlPerf features 99 TPC-DS queries 
(stressing the DataFrame API and therefore the Catalyst optimiser), both 
work well with Spark 2

HiBench: https://github.com/intel-hadoop/HiBench
SparkSqlPerf: https://github.com/databricks/spark-sql-perf




From:   "Kazuaki Ishizaki" 
To: Prasun Ratn 
Cc: Apache Spark Dev 
Date:   10/01/2017 09:22
Subject:Re: Spark performance tests



Hi,
You may find several micro-benchmarks under 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
.

Regards,
Kazuaki Ishizaki



From:Prasun Ratn 
To:Apache Spark Dev 
Date:2017/01/10 12:52
Subject:Spark performance tests



Hi

Are there performance tests or microbenchmarks for Spark - especially
directed towards the CPU specific parts? I looked at spark-perf but
that doesn't seem to have been updated recently.

Thanks
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Spark performance tests

2017-01-10 Thread Kazuaki Ishizaki
Hi,
You may find several micro-benchmarks under 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
.

Regards,
Kazuaki Ishizaki



From:   Prasun Ratn 
To: Apache Spark Dev 
Date:   2017/01/10 12:52
Subject:Spark performance tests



Hi

Are there performance tests or microbenchmarks for Spark - especially
directed towards the CPU specific parts? I looked at spark-perf but
that doesn't seem to have been updated recently.

Thanks
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org