Re: Dataset API agg question
Take a look at the implementation of typed sum/avg: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scalalang/typed.scala You can implement a typed max/min. On Tue, Jun 7, 2016 at 4:31 PM, Alexander Pivovarovwrote: > Ted, It does not work like that > > you have to .map(toAB).toDS > > On Tue, Jun 7, 2016 at 4:07 PM, Ted Yu wrote: > >> Have you tried the following ? >> >> Seq(1->2, 1->5, 3->6).toDS("a", "b") >> >> then you can refer to columns by name. >> >> FYI >> >> >> On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov > > wrote: >> >>> I'm trying to switch from RDD API to Dataset API >>> My question is about reduceByKey method >>> >>> e.g. in the following example I'm trying to rewrite >>> >>> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10) >>> >>> using DS API. That is what I have so far: >>> >>> Seq(1->2, 1->5, >>> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10) >>> >>> Questions: >>> >>> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or >>> replace it with smth shorter? >>> >>> 2. Why I have to use String column name in max function? e.g. $"_2" or >>> col("_2"). can I use _._2 instead? >>> >>> >>> Alex >>> >> >> >
Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
Please go ahead. On Tue, Jun 7, 2016 at 4:45 PM, franklynwrote: > Thanks for reproducing it Ted, should i make a Jira Issue?. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17852.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
Thanks for reproducing it Ted, should i make a Jira Issue?. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17852.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Dataset API agg question
Ted, It does not work like that you have to .map(toAB).toDS On Tue, Jun 7, 2016 at 4:07 PM, Ted Yuwrote: > Have you tried the following ? > > Seq(1->2, 1->5, 3->6).toDS("a", "b") > > then you can refer to columns by name. > > FYI > > > On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarov > wrote: > >> I'm trying to switch from RDD API to Dataset API >> My question is about reduceByKey method >> >> e.g. in the following example I'm trying to rewrite >> >> sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10) >> >> using DS API. That is what I have so far: >> >> Seq(1->2, 1->5, >> 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10) >> >> Questions: >> >> 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace >> it with smth shorter? >> >> 2. Why I have to use String column name in max function? e.g. $"_2" or >> col("_2"). can I use _._2 instead? >> >> >> Alex >> > >
Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
I built with Scala 2.10 >>> df.select(add_one(df.a).alias('incremented')).collect() The above just hung. On Tue, Jun 7, 2016 at 3:31 PM, franklynwrote: > Thanks Ted !. > > I'm using > > https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860 > and building with scala-2.10 > > I can confirm that it works with scala-2.11 > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17847.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Dataset API agg question
Have you tried the following ? Seq(1->2, 1->5, 3->6).toDS("a", "b") then you can refer to columns by name. FYI On Tue, Jun 7, 2016 at 3:58 PM, Alexander Pivovarovwrote: > I'm trying to switch from RDD API to Dataset API > My question is about reduceByKey method > > e.g. in the following example I'm trying to rewrite > > sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10) > > using DS API. That is what I have so far: > > Seq(1->2, 1->5, > 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10) > > Questions: > > 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace > it with smth shorter? > > 2. Why I have to use String column name in max function? e.g. $"_2" or > col("_2"). can I use _._2 instead? > > > Alex >
Dataset API agg question
I'm trying to switch from RDD API to Dataset API My question is about reduceByKey method e.g. in the following example I'm trying to rewrite sc.parallelize(Seq(1->2, 1->5, 3->6)).reduceByKey(math.max).take(10) using DS API. That is what I have so far: Seq(1->2, 1->5, 3->6).toDS.groupBy(_._1).agg(max($"_2").as(ExpressionEncoder[Int])).take(10) Questions: 1. is it possible to avoid typing "as(ExpressionEncoder[Int])" or replace it with smth shorter? 2. Why I have to use String column name in max function? e.g. $"_2" or col("_2"). can I use _._2 instead? Alex
Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
Thanks Ted !. I'm using https://github.com/apache/spark/commit/8f5a04b6299e3a47aca13cbb40e72344c0114860 and building with scala-2.10 I can confirm that it works with scala-2.11 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Can-t-use-UDFs-with-Dataframes-in-spark-2-0-preview-scala-2-10-tp17845p17847.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
With commit 200f01c8fb15680b5630fbd122d44f9b1d096e02 using Scala 2.11: Using Python version 2.7.9 (default, Apr 29 2016 10:48:06) SparkSession available as 'spark'. >>> from pyspark.sql import SparkSession >>> from pyspark.sql.types import IntegerType, StructField, StructType >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import Row >>> spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() >>> add_one = udf(lambda x: x + 1, IntegerType()) >>> schema = StructType([StructField('a', IntegerType(), False)]) >>> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) >>> df.select(add_one(df.a).alias('incremented')).collect() [Row(incremented=2), Row(incremented=3)] Let me build with Scala 2.10 and try again. On Tue, Jun 7, 2016 at 2:47 PM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following >> >> >> ./dev/change-version-to-2.10.sh >> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 >> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > > > and then ran the following code in a pyspark shell > > from pyspark.sql import SparkSession >> from pyspark.sql.types import IntegerType, StructField, StructType >> from pyspark.sql.functions import udf >> from pyspark.sql.types import Row >> spark = SparkSession.builder.master('local[4]').appName('2.0 >> DF').getOrCreate() >> add_one = udf(lambda x: x + 1, IntegerType()) >> schema = StructType([StructField('a', IntegerType(), False)]) >> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) >> df.select(add_one(df.a).alias('incremented')).collect() > > > This never returns with a result. > > >
Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > > > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive and then ran the following code in a pyspark shell from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() This never returns with a result.
Re: Spark 2.0.0-preview artifacts still not available in Maven
As far as I know the process is just to copy docs/_site from the build to the appropriate location in the SVN repo (i.e. site/docs/2.0.0-preview). Thanks Shivaram On Tue, Jun 7, 2016 at 8:14 AM, Sean Owenwrote: > As a stop-gap, I can edit that page to have a small section about > preview releases and point to the nightly docs. > > Not sure who has the power to push 2.0.0-preview to site/docs, but, if > that's done then we can symlink "preview" in that dir to it and be > done, and update this section about preview docs accordingly. > > On Tue, Jun 7, 2016 at 4:10 PM, Tom Graves wrote: >> Thanks Sean, you were right, hard refresh made it show up. >> >> Seems like we should at least link to the preview docs from >> http://spark.apache.org/documentation.html. >> >> Tom >> >> >> On Tuesday, June 7, 2016 10:04 AM, Sean Owen wrote: >> >> >> It's there (refresh maybe?). See the end of the downloads dropdown. >> >> For the moment you can see the docs in the nightly docs build: >> https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/ >> >> I don't know, what's the best way to put this into the main site? >> under a /preview root? I am not sure how that process works. >> >> On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves wrote: >>> I just checked and I don't see the 2.0 preview release at all anymore on >>> .http://spark.apache.org/downloads.html, is it in transition?The only >>> place I can see it is at >>> http://spark.apache.org/news/spark-2.0.0-preview.html >>> >>> >>> I would like to see docs there too. My opinion is it should be as easy to >>> use/try out as any other spark release. >>> >>> Tom >> >>> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Standalone Cluster Mode: how does spark allocate spark.executor.cores?
Hi, I'm searching for how and where spark allocates cores per executor in the source code. Is it possible to control programmaticaly allocated cores in standalone cluster mode? Regards, Matteo -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standalone-Cluster-Mode-how-does-spark-allocate-spark-executor-cores-tp17843.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 2.0.0-preview artifacts still not available in Maven
Thanks Sean, you were right, hard refresh made it show up. Seems like we should at least link to the preview docs from http://spark.apache.org/documentation.html. Tom On Tuesday, June 7, 2016 10:04 AM, Sean Owenwrote: It's there (refresh maybe?). See the end of the downloads dropdown. For the moment you can see the docs in the nightly docs build: https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/ I don't know, what's the best way to put this into the main site? under a /preview root? I am not sure how that process works. On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves wrote: > I just checked and I don't see the 2.0 preview release at all anymore on > .http://spark.apache.org/downloads.html, is it in transition? The only > place I can see it is at > http://spark.apache.org/news/spark-2.0.0-preview.html > > > I would like to see docs there too. My opinion is it should be as easy to > use/try out as any other spark release. > > Tom > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 2.0.0-preview artifacts still not available in Maven
As a stop-gap, I can edit that page to have a small section about preview releases and point to the nightly docs. Not sure who has the power to push 2.0.0-preview to site/docs, but, if that's done then we can symlink "preview" in that dir to it and be done, and update this section about preview docs accordingly. On Tue, Jun 7, 2016 at 4:10 PM, Tom Graveswrote: > Thanks Sean, you were right, hard refresh made it show up. > > Seems like we should at least link to the preview docs from > http://spark.apache.org/documentation.html. > > Tom > > > On Tuesday, June 7, 2016 10:04 AM, Sean Owen wrote: > > > It's there (refresh maybe?). See the end of the downloads dropdown. > > For the moment you can see the docs in the nightly docs build: > https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/ > > I don't know, what's the best way to put this into the main site? > under a /preview root? I am not sure how that process works. > > On Tue, Jun 7, 2016 at 4:01 PM, Tom Graves wrote: >> I just checked and I don't see the 2.0 preview release at all anymore on >> .http://spark.apache.org/downloads.html, is it in transition?The only >> place I can see it is at >> http://spark.apache.org/news/spark-2.0.0-preview.html >> >> >> I would like to see docs there too. My opinion is it should be as easy to >> use/try out as any other spark release. >> >> Tom > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 2.0.0-preview artifacts still not available in Maven
It's there (refresh maybe?). See the end of the downloads dropdown. For the moment you can see the docs in the nightly docs build: https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/ I don't know, what's the best way to put this into the main site? under a /preview root? I am not sure how that process works. On Tue, Jun 7, 2016 at 4:01 PM, Tom Graveswrote: > I just checked and I don't see the 2.0 preview release at all anymore on > .http://spark.apache.org/downloads.html, is it in transition?The only > place I can see it is at > http://spark.apache.org/news/spark-2.0.0-preview.html > > > I would like to see docs there too. My opinion is it should be as easy to > use/try out as any other spark release. > > Tom > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Welcoming Yanbo Liang as a committer
Congrats!! On Mon, Jun 6, 2016, 8:12 AM Gayathri Muraliwrote: > Congratulations Yanbo Liang! Well deserved. > > > On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Congrats, Yanbo! >> >> On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin wrote: >> >>> Congratulations Yanbo! >>> >>> On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler wrote: >>> Congratulations Yanbo! On Jun 5, 2016 4:03 AM, "Kousuke Saruta" wrote: > Congratulations Yanbo! > > > - Kousuke > > On 2016/06/04 11:48, Matei Zaharia wrote: > >> Hi all, >> >> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has >> been a super active contributor in many areas of MLlib. Please join me in >> welcoming Yanbo! >> >> Matei >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > >>> >> >
streaming JobScheduler and error handling confusing behavior
Hi, I don't know if it is a bug or a feature, but one thing in streaming error handling seems confusing to me - I create streaming context, start and call #awaitTermination like this: try { ssc.awaitTermination(); } catch (Exception e) { LoggerFactory.getLogger(getClass()).error("Job failed. Stopping JVM", e); System.exit(-1); } I expect that jvm will be terminated as soon as any job fails and no more jobs are started. But actually this is not true - before exception is caught another job starts. This is caused by the design of JobScheduler event loop: private def processEvent(event: JobSchedulerEvent) { try { event match { case JobStarted(job, startTime) => handleJobStart(job, startTime) case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime) case ErrorReported(m, e) => handleError(m, e) } } catch { case e: Throwable => reportError("Error in job scheduler", e) } } If error happens it calls handleError that wakes up a lock in ContextWaiter and notifies my main thread. But meanwhile it starts next job, and sometimes it is enough to complete it! I have several jobs in each batch and want each of them run only and only if previous completed successfully. For API user point of view this behavior is confusing and you cannot guess how it works until looking into the source code. What do you think about adding another spark configuration parameter 'stopOnError' that stops the streaming context if error happens and does not allow to run next job?