Re: Dataset API agg question

2016-06-07 Thread Reynold Xin
Take a look at the implementation of typed sum/avg:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scalalang/typed.scala

You can implement a typed max/min.


On Tue, Jun 7, 2016 at 4:31 PM, Alexander Pivovarov 
wrote:

> Ted, It does not work like that
>
> you have to .map(toAB).toDS
>
> On Tue, Jun 7, 2016 at 4:07 PM, Ted Yu  wrote:
>
>> Have you tried the following ?
>>
>> Seq(1->2, 1->5, 3->6).toDS("a", "b")
>>
>> then you can refer to columns by name.
>>
>> FYI
>>
>>
>> On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov > > wrote:
>>
>>> I'm trying to switch from RDD API to Dataset API
>>> My question is about reduceByKey method
>>>
>>> e.g. in the following example I'm trying to rewrite
>>>
>>> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)
>>>
>>> using DS API. That is what I have so far:
>>>
>>> Seq(1->2, 1->5, 
>>> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10)
>>>
>>> Questions:
>>>
>>> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or
>>> replace it with smth shorter?
>>>
>>> 2.  Why I have to use String column name in max function? e.g. $"_2" or
>>> col("_2").  can I use _._2 instead?
>>>
>>>
>>> Alex
>>>
>>
>>
>


Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
Please go ahead.

On Tue, Jun 7, 2016 at 4:45 PM, franklyn 
wrote:

> Thanks for reproducing it Ted, should i make a Jira Issue?.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17852.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread franklyn
Thanks for reproducing it Ted, should i make a Jira Issue?.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17852.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataset API agg question

2016-06-07 Thread Alexander Pivovarov
Ted, It does not work like that

you have to .map(toAB).toDS

On Tue, Jun 7, 2016 at 4:07 PM, Ted Yu  wrote:

> Have you tried the following ?
>
> Seq(1->2, 1->5, 3->6).toDS("a", "b")
>
> then you can refer to columns by name.
>
> FYI
>
>
> On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov 
> wrote:
>
>> I'm trying to switch from RDD API to Dataset API
>> My question is about reduceByKey method
>>
>> e.g. in the following example I'm trying to rewrite
>>
>> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)
>>
>> using DS API. That is what I have so far:
>>
>> Seq(1->2, 1->5, 
>> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10)
>>
>> Questions:
>>
>> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace
>> it with smth shorter?
>>
>> 2.  Why I have to use String column name in max function? e.g. $"_2" or
>> col("_2").  can I use _._2 instead?
>>
>>
>> Alex
>>
>
>


Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
I built with Scala 2.10

>>> df.select(add_one(df.a).alias('incremented')).collect()

The above just hung.

On Tue, Jun 7, 2016 at 3:31 PM, franklyn 
wrote:

> Thanks Ted !.
>
> I'm using
>
> https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860
> and building with scala-2.10
>
> I can confirm that it works with scala-2.11
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17847.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Dataset API agg question

2016-06-07 Thread Ted Yu
Have you tried the following ?

Seq(1->2, 1->5, 3->6).toDS("a", "b")

then you can refer to columns by name.

FYI


On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov 
wrote:

> I'm trying to switch from RDD API to Dataset API
> My question is about reduceByKey method
>
> e.g. in the following example I'm trying to rewrite
>
> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)
>
> using DS API. That is what I have so far:
>
> Seq(1->2, 1->5, 
> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10)
>
> Questions:
>
> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace
> it with smth shorter?
>
> 2.  Why I have to use String column name in max function? e.g. $"_2" or
> col("_2").  can I use _._2 instead?
>
>
> Alex
>


Dataset API agg question

2016-06-07 Thread Alexander Pivovarov
I'm trying to switch from RDD API to Dataset API
My question is about reduceByKey method

e.g. in the following example I'm trying to rewrite

sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10)

using DS API. That is what I have so far:

Seq(1->2, 1->5,
3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10)

Questions:

1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace
it with smth shorter?

2.  Why I have to use String column name in max function? e.g. $"_2" or
col("_2").  can I use _._2 instead?


Alex


Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread franklyn
Thanks Ted !.

I'm using
https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860
and building with scala-2.10

I can confirm that it works with scala-2.11



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17847.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Ted Yu
With commit 200f01c8fb15680b5630fbd122d44f9b1d096e02 using Scala 2.11:

Using Python version 2.7.9 (default, Apr 29 2016 10:48:06)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import IntegerType, StructField, StructType
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import Row
>>> spark = SparkSession.builder.master('local[4]').appName('2.0
DF').getOrCreate()
>>> add_one = udf(lambda x: x + 1, IntegerType())
>>> schema = StructType([StructField('a', IntegerType(), False)])
>>> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
>>> df.select(add_one(df.a).alias('incremented')).collect()
[Row(incremented=2), Row(incremented=3)]

Let me build with Scala 2.10 and try again.

On Tue, Jun 7, 2016 at 2:47 PM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
>>
>>
>> ./dev/change-version-to-2.10.sh
>> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5
>> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
>
>
> and then ran the following code in a pyspark shell
>
> from pyspark.sql import SparkSession
>> from pyspark.sql.types import IntegerType, StructField, StructType
>> from pyspark.sql.functions import udf
>> from pyspark.sql.types import Row
>> spark = SparkSession.builder.master('local[4]').appName('2.0
>> DF').getOrCreate()
>> add_one = udf(lambda x: x + 1, IntegerType())
>> schema = StructType([StructField('a', IntegerType(), False)])
>> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
>> df.select(add_one(df.a).alias('incremented')).collect()
>
>
> This never returns with a result.
>
>
>


Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Franklyn D'souza
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
>
>
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive


and then ran the following code in a pyspark shell

from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()


This never returns with a result.


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-07 Thread Shivaram Venkataraman
As far as I know the process is just to copy docs/_site from the build
to the appropriate location in the SVN repo (i.e.
site/docs/2.0.0-preview).

Thanks
Shivaram

On Tue, Jun 7, 2016 at 8:14 AM, Sean Owen  wrote:
> As a stop-gap, I can edit that page to have a small section about
> preview releases and point to the nightly docs.
>
> Not sure who has the power to push 2.0.0-preview to site/docs, but, if
> that's done then we can symlink "preview" in that dir to it and be
> done, and update this section about preview docs accordingly.
>
> On Tue, Jun 7, 2016 at 4:10 PM, Tom Graves  wrote:
>> Thanks Sean, you were right, hard refresh made it show up.
>>
>> Seems like we should at least link to the preview docs from
>> http://spark.apache.org/documentation.html.
>>
>> Tom
>>
>>
>> On Tuesday, June 7, 2016 10:04 AM, Sean Owen  wrote:
>>
>>
>> It's there (refresh maybe?). See the end of the downloads dropdown.
>>
>> For the moment you can see the docs in the nightly docs build:
>> https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/
>>
>> I don't know, what's the best way to put this into the main site?
>> under a /preview root? I am not sure how that process works.
>>
>> On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves  wrote:
>>> I just checked and I don't see the 2.0 preview release at all anymore on
>>> .http://spark.apache.org/downloads.html, is it in transition?The only
>>> place I can see it is at
>>> http://spark.apache.org/news/spark-2.0.0-preview.html
>>>
>>>
>>> I would like to see docs there too.  My opinion is it should be as easy to
>>> use/try out as any other spark release.
>>>
>>> Tom
>>
>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Standalone Cluster Mode: how does spark allocate spark.executor.cores?

2016-06-07 Thread ElfoLiNk
Hi,
I'm searching for how and where spark allocates cores per executor in the
source code.
Is it possible to control programmaticaly allocated cores in standalone
cluster mode?

Regards,
Matteo



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Standalone-Cluster-Mode-how-does-spark-allocate-spark-executor-cores-tp17843.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-07 Thread Tom Graves
Thanks Sean, you were right, hard refresh made it show up.
Seems like we should at least link to the preview docs from 
http://spark.apache.org/documentation.html.
Tom 

On Tuesday, June 7, 2016 10:04 AM, Sean Owen  wrote:
 

 It's there (refresh maybe?). See the end of the downloads dropdown.

For the moment you can see the docs in the nightly docs build:
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/

I don't know, what's the best way to put this into the main site?
under a /preview root? I am not sure how that process works.

On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves  wrote:
> I just checked and I don't see the 2.0 preview release at all anymore on
> .http://spark.apache.org/downloads.html, is it in transition?    The only
> place I can see it is at
> http://spark.apache.org/news/spark-2.0.0-preview.html
>
>
> I would like to see docs there too.  My opinion is it should be as easy to
> use/try out as any other spark release.
>
> Tom
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-07 Thread Sean Owen
As a stop-gap, I can edit that page to have a small section about
preview releases and point to the nightly docs.

Not sure who has the power to push 2.0.0-preview to site/docs, but, if
that's done then we can symlink "preview" in that dir to it and be
done, and update this section about preview docs accordingly.

On Tue, Jun 7, 2016 at 4:10 PM, Tom Graves  wrote:
> Thanks Sean, you were right, hard refresh made it show up.
>
> Seems like we should at least link to the preview docs from
> http://spark.apache.org/documentation.html.
>
> Tom
>
>
> On Tuesday, June 7, 2016 10:04 AM, Sean Owen  wrote:
>
>
> It's there (refresh maybe?). See the end of the downloads dropdown.
>
> For the moment you can see the docs in the nightly docs build:
> https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/
>
> I don't know, what's the best way to put this into the main site?
> under a /preview root? I am not sure how that process works.
>
> On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves  wrote:
>> I just checked and I don't see the 2.0 preview release at all anymore on
>> .http://spark.apache.org/downloads.html, is it in transition?The only
>> place I can see it is at
>> http://spark.apache.org/news/spark-2.0.0-preview.html
>>
>>
>> I would like to see docs there too.  My opinion is it should be as easy to
>> use/try out as any other spark release.
>>
>> Tom
>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-07 Thread Sean Owen
It's there (refresh maybe?). See the end of the downloads dropdown.

For the moment you can see the docs in the nightly docs build:
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/

I don't know, what's the best way to put this into the main site?
under a /preview root? I am not sure how that process works.

On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves  wrote:
> I just checked and I don't see the 2.0 preview release at all anymore on
> .http://spark.apache.org/downloads.html, is it in transition?The only
> place I can see it is at
> http://spark.apache.org/news/spark-2.0.0-preview.html
>
>
> I would like to see docs there too.  My opinion is it should be as easy to
> use/try out as any other spark release.
>
> Tom
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Welcoming Yanbo Liang as a committer

2016-06-07 Thread Xiangrui Meng
Congrats!!

On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali 
wrote:

> Congratulations Yanbo Liang! Well deserved.
>
>
> On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Congrats, Yanbo!
>>
>> On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin  wrote:
>>
>>> Congratulations Yanbo!
>>>
>>> On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler  wrote:
>>>
 Congratulations Yanbo!
 On Jun 5, 2016 4:03 AM, "Kousuke Saruta" 
 wrote:

> Congratulations Yanbo!
>
>
> - Kousuke
>
> On 2016/06/04 11:48, Matei Zaharia wrote:
>
>> Hi all,
>>
>> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has
>> been a super active contributor in many areas of MLlib. Please join me in
>> welcoming Yanbo!
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>>>
>>
>


streaming JobScheduler and error handling confusing behavior

2016-06-07 Thread Krot Viacheslav
Hi,
I don't know if it is a bug or a feature, but one thing in streaming error
handling seems confusing to me - I create streaming context, start and call
#awaitTermination like this:

try {
  ssc.awaitTermination();
} catch (Exception e) {
  LoggerFactory.getLogger(getClass()).error("Job failed. Stopping JVM", e);
  System.exit(-1);
}

I expect that jvm will be terminated as soon as any job fails and no more
jobs are started. But actually this is not true - before exception is
caught another job starts.
This is caused by the design of JobScheduler event loop:

private def processEvent(event: JobSchedulerEvent) {
  try {
event match {
  case JobStarted(job, startTime) => handleJobStart(job, startTime)
  case JobCompleted(job, completedTime) => handleJobCompletion(job,
completedTime)
  case ErrorReported(m, e) => handleError(m, e)
}
  } catch {
case e: Throwable =>
  reportError("Error in job scheduler", e)
  }
}

If error happens it calls handleError that wakes up a lock in ContextWaiter
and notifies my main thread. But meanwhile it starts next job, and
sometimes it is enough to complete it! I have several jobs in each batch
and want each of them run only and only if previous completed successfully.

For API user point of view this behavior is confusing and you cannot guess
how it works until looking into the source code.

What do you think about adding another spark configuration parameter
'stopOnError' that stops the streaming context if error happens and does
not allow to run next job?