Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Jagadeesan As
+1 (non binding)

Ubuntu 14.04.2/openjdk  "1.8.0_72"
(-Pyarn -Phadoop-2.7 -Psparkr -Pkinesis-asl -Phive-thriftserver)
 
Cheers,
Jagadeesan A S



From:   Reynold Xin 
To: "dev@spark.apache.org" 
Date:   27-10-16 12:49 PM
Subject:[VOTE] Release Apache Spark 2.0.2 (RC1)



Greetings from Spark Summit Europe at Brussels.
+1 (non binding)

Ubuntu 14.04.2/openjdk  "1.8.0_72"
(-Pyarn -Phadoop-2.7 -Psparkr -Pkinesis-asl -Phive-thriftserver)
 
Cheers,
Jagadeesan A S


Please vote on releasing the following candidate as Apache Spark version 
2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if 
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc1 
(1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)

This release candidate resolves 75 issues: 
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1208/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present 
in 2.0.1, missing features, or bugs related to new features will not 
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from 
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC 
(i.e. RC2) is cut, I will change the fix version of those patches to 
2.0.2.






Re: SparkR issue with array types in gapply()

2016-10-27 Thread Felix Cheung
This is a R native data.frame behavior.

While arr is a character vector of length = 2,
> arr
[1] "rows= 50" "cols= 2"
> length(arr)
[1] 2


when it is set as R data.frame the character vector is splitted into 2 rows


> data.frame(key, strings = arr, stringsAsFactors = F)
  key strings
1 a rows= 50
2 a cols= 2


> b <- data.frame(key, strings = arr, stringsAsFactors = F)
> sapply(b, class)
key strings
"character" "character"
> b[1,1]
[1] "a"
> b[1,2]
[1] "rows= 50"
> b[2,2]
[1] "cols= 2"


And each is separate in the character column. This causes a schema mismatch 
when it is expecting a string array, not just string when you set schema to 
have  structField('strings', 'array')


_
From: shirisht >
Sent: Tuesday, October 25, 2016 11:51 PM
Subject: SparkR issue with array types in gapply()
To: >


Hello,

I am getting an exception from catalyst when array types are used in the
return schema of gapply() function.

Following is a (made-up) example:


iris$flag = base::sample(1:2, nrow(iris), T, prob = c(0.5,0.5))
irisdf = createDataFrame(iris)

foo = function(key, x) {
nr = nrow(x)
nc = ncol(x)
arr = c( paste("rows=", nr), paste("cols=",nc) )
data.frame(key, strings = arr, stringsAsFactors = F)
}

outSchema = structType( structField('key', 'integer'),
structField('strings', 'array') )
result = SparkR::gapply(irisdf, "flag", foo, outSchema)
d = SparkR::collect(result)


This code throws up the following error:

java.lang.RuntimeException: java.lang.String is not a valid external type
for schema of array
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any thoughts?

Thank you,
Shirish



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-issue-with-array-types-in-gapply-tp19568.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org





Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Luciano Resende
+1 (non-binding)

On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin  wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Denny Lee
+1 (non-binding)



On Thu, Oct 27, 2016 at 3:36 PM Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non-binding)
>
> built and tested without regressions from 2.0.1.
>
>
>
> On 27 October 2016 at 19:07, vaquar khan  wrote:
>
> +1
>
>
>
> On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu 
> wrote:
>
> +1
>
> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin  wrote:
> > Greetings from Spark Summit Europe at Brussels.
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes
> if a
> > majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.2-rc1
> > (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
> >
> > This release candidate resolves 75 issues:
> > https://s.apache.org/spark-2.0.2-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1208/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
> >
> >
> > Q: How can I help test this release?
> > A: If you are a Spark user, you can help us test this release by taking
> an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 2.0.1.
> >
> > Q: What justifies a -1 vote for this release?
> > A: This is a maintenance release in the 2.0.x series. Bugs already
> present
> > in 2.0.1, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> > Q: What fix version should I use for patches merging into branch-2.0 from
> > now on?
> > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> > (i.e. RC2) is cut, I will change the fix version of those patches to
> 2.0.2.
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
>
> IT Architect / Lead Consultant
> Greater Chicago
>
>
>


Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-18147

On Thu, Oct 27, 2016 at 4:55 PM, Koert Kuipers  wrote:

> ok will do
>
> On Thu, Oct 27, 2016 at 4:51 PM, Michael Armbrust 
> wrote:
>
>> I would categorize these as bugs.  We should (but probably don't fully
>> yet) support arbitrary nesting as long as you use basic collections / case
>> classes / primitives.  Please do open JIRAs as you find problems.
>>
>> On Thu, Oct 27, 2016 at 1:05 PM, Koert Kuipers  wrote:
>>
>>> well i was using Aggregators that returned sequences of structs, or
>>> structs with sequences inside etc. and got compilation errors on the
>>> codegen.
>>>
>>> i didnt bother trying to reproduce it so far, or post it, since what i
>>> did was beyond the supposed usage anyhow.
>>>
>>>
>>> On Thu, Oct 27, 2016 at 4:02 PM, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
 What kind of difficulties are you experiencing?

 On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers 
 wrote:

> i have been pushing my luck a bit and started using ExpressionEncoder
> for more complex types like sequences of case classes etc. (where the case
> classes only had primitives and Strings).
>
> it all seems to work but i think the wheels come off in certain cases
> in the code generation. i guess this is not unexpected, after all what i 
> am
> doing is not yet supported.
>
> is there a planned path forward to support more complex types with
> encoders? it would be nice if we can at least support all types that
> spark-sql supports in general for DataFrame.
>
> best, koert
>


>>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Ricardo Almeida
+1 (non-binding)

built and tested without regressions from 2.0.1.



On 27 October 2016 at 19:07, vaquar khan  wrote:

> +1
>
>
>
> On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu 
> wrote:
>
>> +1
>>
>> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin 
>> wrote:
>> > Greetings from Spark Summit Europe at Brussels.
>> >
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes
>> if a
>> > majority of at least 3+1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.0.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The tag to be voted on is v2.0.2-rc1
>> > (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>> >
>> > This release candidate resolves 75 issues:
>> > https://s.apache.org/spark-2.0.2-jira
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1208/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>> >
>> >
>> > Q: How can I help test this release?
>> > A: If you are a Spark user, you can help us test this release by taking
>> an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions from 2.0.1.
>> >
>> > Q: What justifies a -1 vote for this release?
>> > A: This is a maintenance release in the 2.0.x series. Bugs already
>> present
>> > in 2.0.1, missing features, or bugs related to new features will not
>> > necessarily block this release.
>> >
>> > Q: What fix version should I use for patches merging into branch-2.0
>> from
>> > now on?
>> > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> > (i.e. RC2) is cut, I will change the fix version of those patches to
>> 2.0.2.
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
>
> IT Architect / Lead Consultant
> Greater Chicago
>


Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-27 Thread Dongjoon Hyun
Hi, All.

Last time, RC1 passed the tests with only the timezone testcase failure. Now, 
it's backported, too.
I'm wondering if we have other issues to block releasing Apache Spark 1.6.3.

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Ofir Manor
I totally agree with Sean, just a small correction:
Java 7 and Python 2.6 are already deprecated since Spark 2.0 (after a
lengthy discussion), so there is no need to discuss whether they should
become deprecated in 2.1
  http://spark.apache.org/releases/spark-release-2-0-0.html#deprecations
The discussion is whether Scala 2.10 should also be marked as deprecated
(no one is objecting that), and more importantly, when to actually move
from deprecation to actually dropping support for any combination of JDK /
Scala / Hadoop / Python.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Fri, Oct 28, 2016 at 12:13 AM, Sean Owen  wrote:

> The burden may be a little more apparent when dealing with the day to day
> merging and fixing of breaks. The upside is maybe the more compelling
> argument though. For example, lambda-fying all the Java code, supporting
> java.time, and taking advantage of some newer Hadoop/YARN APIs is a
> moderate win for users too, and there's also a cost to not doing that.
>
> I must say I don't see a risk of fragmentation as nearly the problem it's
> made out to be here. We are, after all, here discussing _beginning_ to
> remove support _in 6 months_, for long since non-current versions of
> things. An org's decision to not, say, use Java 8 is a decision to not use
> the new version of lots of things. It's not clear this is a constituency
> that is either large or one to reasonably serve indefinitely.
>
> In the end, the Scala issue may be decisive. Supporting 2.10 - 2.12
> simultaneously is a bridge too far, and if 2.12 requires Java 8, it's a
> good reason to for Spark to require Java 8. And Steve suggests that means a
> minimum of Hadoop 2.6 too. (I still profess ignorance of the Python part of
> the issue.)
>
> Put another way I am not sure what the criteria is, if not the above?
>
> I support deprecating all of these things, at the least, in 2.1.0.
> Although it's a separate question, I believe it's going to be necessary to
> remove support in ~6 months in 2.2.0.
>
>
> On Thu, Oct 27, 2016 at 4:36 PM Matei Zaharia 
> wrote:
>
>> Just to comment on this, I'm generally against removing these types of
>> things unless they create a substantial burden on project contributors. It
>> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
>> but then of course we need to wait for 2.12 to be out and stable.
>>
>> In general, this type of stuff only hurts users, and doesn't have a huge
>> impact on Spark contributors' productivity (sure, it's a bit unpleasant,
>> but that's life). If we break compatibility this way too quickly, we
>> fragment the user community, and then either people have a crappy
>> experience with Spark because their corporate IT doesn't yet have an
>> environment that can run the latest version, or worse, they create more
>> maintenance burden for us because they ask for more patches to be
>> backported to old Spark versions (1.6.x, 2.0.x, etc). Python in particular
>> is pretty fundamental to many Linux distros.
>>
>> In the future, rather than just looking at when some software came out,
>> it may be good to have some criteria for when to drop support for
>> something. For example, if there are really nice libraries in Python 2.7 or
>> Java 8 that we're missing out on, that may be a good reason. The
>> maintenance burden for multiple Scala versions is definitely painful but I
>> also think we should always support the latest two Scala releases.
>>
>> Matei
>>
>> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
>>
>> I created a JIRA ticket to track this: https://issues.apache.
>> org/jira/browse/SPARK-18138
>>
>>
>>
>> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
>> wrote:
>>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>>
>> that sounds good to me
>>
>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:
>>
>> We can do the following concrete proposal:
>>
>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr
>> 2017).
>>
>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
>> deprecation of Java 7 / Scala 2.10 support.
>>
>> (a) It should appear in release notes, documentations that mention how to
>> build Spark
>>
>> (b) and a warning should be shown every time SparkContext is started
>> using Scala 2.10 or Java 7.
>>
>>
>>
>>
>>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Dongjoon Hyun
+1 non-binding.

Built and tested CentOS 6.6 / OpenJDK 1.8.0_111.

Cheers,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Sean Owen
The burden may be a little more apparent when dealing with the day to day
merging and fixing of breaks. The upside is maybe the more compelling
argument though. For example, lambda-fying all the Java code, supporting
java.time, and taking advantage of some newer Hadoop/YARN APIs is a
moderate win for users too, and there's also a cost to not doing that.

I must say I don't see a risk of fragmentation as nearly the problem it's
made out to be here. We are, after all, here discussing _beginning_ to
remove support _in 6 months_, for long since non-current versions of
things. An org's decision to not, say, use Java 8 is a decision to not use
the new version of lots of things. It's not clear this is a constituency
that is either large or one to reasonably serve indefinitely.

In the end, the Scala issue may be decisive. Supporting 2.10 - 2.12
simultaneously is a bridge too far, and if 2.12 requires Java 8, it's a
good reason to for Spark to require Java 8. And Steve suggests that means a
minimum of Hadoop 2.6 too. (I still profess ignorance of the Python part of
the issue.)

Put another way I am not sure what the criteria is, if not the above?

I support deprecating all of these things, at the least, in 2.1.0. Although
it's a separate question, I believe it's going to be necessary to remove
support in ~6 months in 2.2.0.


On Thu, Oct 27, 2016 at 4:36 PM Matei Zaharia 
wrote:

> Just to comment on this, I'm generally against removing these types of
> things unless they create a substantial burden on project contributors. It
> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
> but then of course we need to wait for 2.12 to be out and stable.
>
> In general, this type of stuff only hurts users, and doesn't have a huge
> impact on Spark contributors' productivity (sure, it's a bit unpleasant,
> but that's life). If we break compatibility this way too quickly, we
> fragment the user community, and then either people have a crappy
> experience with Spark because their corporate IT doesn't yet have an
> environment that can run the latest version, or worse, they create more
> maintenance burden for us because they ask for more patches to be
> backported to old Spark versions (1.6.x, 2.0.x, etc). Python in particular
> is pretty fundamental to many Linux distros.
>
> In the future, rather than just looking at when some software came out, it
> may be good to have some criteria for when to drop support for something.
> For example, if there are really nice libraries in Python 2.7 or Java 8
> that we're missing out on, that may be a good reason. The maintenance
> burden for multiple Scala versions is definitely painful but I also think
> we should always support the latest two Scala releases.
>
> Matei
>
> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
>
> I created a JIRA ticket to track this:
> https://issues.apache.org/jira/browse/SPARK-18138
>
>
>
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
> wrote:
>
>
> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>
> Seems OK by me.
> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
> to add that to a list of things that will begin to be unsupported 6 months
> from now.
>
>
> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>
>
> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>
> that sounds good to me
>
> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:
>
> We can do the following concrete proposal:
>
> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr
> 2017).
>
> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
> deprecation of Java 7 / Scala 2.10 support.
>
> (a) It should appear in release notes, documentations that mention how to
> build Spark
>
> (b) and a warning should be shown every time SparkContext is started using
> Scala 2.10 or Java 7.
>
>
>
>
>


Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
ok will do

On Thu, Oct 27, 2016 at 4:51 PM, Michael Armbrust 
wrote:

> I would categorize these as bugs.  We should (but probably don't fully
> yet) support arbitrary nesting as long as you use basic collections / case
> classes / primitives.  Please do open JIRAs as you find problems.
>
> On Thu, Oct 27, 2016 at 1:05 PM, Koert Kuipers  wrote:
>
>> well i was using Aggregators that returned sequences of structs, or
>> structs with sequences inside etc. and got compilation errors on the
>> codegen.
>>
>> i didnt bother trying to reproduce it so far, or post it, since what i
>> did was beyond the supposed usage anyhow.
>>
>>
>> On Thu, Oct 27, 2016 at 4:02 PM, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> What kind of difficulties are you experiencing?
>>>
>>> On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers 
>>> wrote:
>>>
 i have been pushing my luck a bit and started using ExpressionEncoder
 for more complex types like sequences of case classes etc. (where the case
 classes only had primitives and Strings).

 it all seems to work but i think the wheels come off in certain cases
 in the code generation. i guess this is not unexpected, after all what i am
 doing is not yet supported.

 is there a planned path forward to support more complex types with
 encoders? it would be nice if we can at least support all types that
 spark-sql supports in general for DataFrame.

 best, koert

>>>
>>>
>>
>


Re: encoders for more complex types

2016-10-27 Thread Michael Armbrust
I would categorize these as bugs.  We should (but probably don't fully yet)
support arbitrary nesting as long as you use basic collections / case
classes / primitives.  Please do open JIRAs as you find problems.

On Thu, Oct 27, 2016 at 1:05 PM, Koert Kuipers  wrote:

> well i was using Aggregators that returned sequences of structs, or
> structs with sequences inside etc. and got compilation errors on the
> codegen.
>
> i didnt bother trying to reproduce it so far, or post it, since what i did
> was beyond the supposed usage anyhow.
>
>
> On Thu, Oct 27, 2016 at 4:02 PM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> What kind of difficulties are you experiencing?
>>
>> On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers  wrote:
>>
>>> i have been pushing my luck a bit and started using ExpressionEncoder
>>> for more complex types like sequences of case classes etc. (where the case
>>> classes only had primitives and Strings).
>>>
>>> it all seems to work but i think the wheels come off in certain cases in
>>> the code generation. i guess this is not unexpected, after all what i am
>>> doing is not yet supported.
>>>
>>> is there a planned path forward to support more complex types with
>>> encoders? it would be nice if we can at least support all types that
>>> spark-sql supports in general for DataFrame.
>>>
>>> best, koert
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Kousuke Saruta

+1

- Kousuke


On 2016/10/28 2:07, vaquar khan wrote:

+1



On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu > wrote:


+1

On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin > wrote:
> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark
version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and
passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1
> (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>
> This release candidate resolves 75 issues:
> https://s.apache.org/spark-2.0.2-jira

>
> The release files, including signatures, digests, etc. can be
found at:
>
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/

>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc

>
> The staging repository for this release can be found at:
>
https://repository.apache.org/content/repositories/orgapachespark-1208/

>
> The documentation corresponding to this release can be found at:
>
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/

>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by
taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs
already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into
branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a
new RC
> (i.e. RC2) is cut, I will change the fix version of those
patches to 2.0.2.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





--
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago




Re: encoders for more complex types

2016-10-27 Thread Koert Kuipers
well i was using Aggregators that returned sequences of structs, or structs
with sequences inside etc. and got compilation errors on the codegen.

i didnt bother trying to reproduce it so far, or post it, since what i did
was beyond the supposed usage anyhow.


On Thu, Oct 27, 2016 at 4:02 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> What kind of difficulties are you experiencing?
>
> On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers  wrote:
>
>> i have been pushing my luck a bit and started using ExpressionEncoder for
>> more complex types like sequences of case classes etc. (where the case
>> classes only had primitives and Strings).
>>
>> it all seems to work but i think the wheels come off in certain cases in
>> the code generation. i guess this is not unexpected, after all what i am
>> doing is not yet supported.
>>
>> is there a planned path forward to support more complex types with
>> encoders? it would be nice if we can at least support all types that
>> spark-sql supports in general for DataFrame.
>>
>> best, koert
>>
>
>


Re: encoders for more complex types

2016-10-27 Thread Herman van Hövell tot Westerflier
What kind of difficulties are you experiencing?

On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers  wrote:

> i have been pushing my luck a bit and started using ExpressionEncoder for
> more complex types like sequences of case classes etc. (where the case
> classes only had primitives and Strings).
>
> it all seems to work but i think the wheels come off in certain cases in
> the code generation. i guess this is not unexpected, after all what i am
> doing is not yet supported.
>
> is there a planned path forward to support more complex types with
> encoders? it would be nice if we can at least support all types that
> spark-sql supports in general for DataFrame.
>
> best, koert
>


encoders for more complex types

2016-10-27 Thread Koert Kuipers
i have been pushing my luck a bit and started using ExpressionEncoder for
more complex types like sequences of case classes etc. (where the case
classes only had primitives and Strings).

it all seems to work but i think the wheels come off in certain cases in
the code generation. i guess this is not unexpected, after all what i am
doing is not yet supported.

is there a planned path forward to support more complex types with
encoders? it would be nice if we can at least support all types that
spark-sql supports in general for DataFrame.

best, koert


[JIRA] (SPARK-575) Maintain a cache of JARs on each node to avoid unnecessary copying

2016-10-27 Thread Anonymous (JIRA)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Anonymous started work on  SPARK-575  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
Change By: 
 Anonymous  
 
 
Status: 
 Open In Progress  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian JIRA (v1000.482.3#100017-sha1:2ae3eee)  
 
 

 
   
 

  
 

  
 

   



Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread vaquar khan
+1



On Thu, Oct 27, 2016 at 11:56 AM, Davies Liu  wrote:

> +1
>
> On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin  wrote:
> > Greetings from Spark Summit Europe at Brussels.
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes
> if a
> > majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.2
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v2.0.2-rc1
> > (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
> >
> > This release candidate resolves 75 issues:
> > https://s.apache.org/spark-2.0.2-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1208/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
> >
> >
> > Q: How can I help test this release?
> > A: If you are a Spark user, you can help us test this release by taking
> an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 2.0.1.
> >
> > Q: What justifies a -1 vote for this release?
> > A: This is a maintenance release in the 2.0.x series. Bugs already
> present
> > in 2.0.1, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> > Q: What fix version should I use for patches merging into branch-2.0 from
> > now on?
> > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> > (i.e. RC2) is cut, I will change the fix version of those patches to
> 2.0.2.
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Davies Liu
+1 for Matei's point.

On Thu, Oct 27, 2016 at 8:36 AM, Matei Zaharia  wrote:
> Just to comment on this, I'm generally against removing these types of
> things unless they create a substantial burden on project contributors. It
> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
> but then of course we need to wait for 2.12 to be out and stable.
>
> In general, this type of stuff only hurts users, and doesn't have a huge
> impact on Spark contributors' productivity (sure, it's a bit unpleasant, but
> that's life). If we break compatibility this way too quickly, we fragment
> the user community, and then either people have a crappy experience with
> Spark because their corporate IT doesn't yet have an environment that can
> run the latest version, or worse, they create more maintenance burden for us
> because they ask for more patches to be backported to old Spark versions
> (1.6.x, 2.0.x, etc). Python in particular is pretty fundamental to many
> Linux distros.
>
> In the future, rather than just looking at when some software came out, it
> may be good to have some criteria for when to drop support for something.
> For example, if there are really nice libraries in Python 2.7 or Java 8 that
> we're missing out on, that may be a good reason. The maintenance burden for
> multiple Scala versions is definitely painful but I also think we should
> always support the latest two Scala releases.
>
> Matei
>
> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
>
> I created a JIRA ticket to track this:
> https://issues.apache.org/jira/browse/SPARK-18138
>
>
>
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
> wrote:
>>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>>>
>>> that sounds good to me
>>>
>>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:

 We can do the following concrete proposal:

 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0
 (Mar/Apr 2017).

 2. In Spark 2.1.0 release, aggressively and explicitly announce the
 deprecation of Java 7 / Scala 2.10 support.

 (a) It should appear in release notes, documentations that mention how
 to build Spark

 (b) and a warning should be shown every time SparkContext is started
 using Scala 2.10 or Java 7.

>>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Davies Liu
+1

On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin  wrote:
> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1
> (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>
> This release candidate resolves 75 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Michael Armbrust
+1

On Oct 27, 2016 12:19 AM, "Reynold Xin"  wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Amit Tank
+1 for Matei's point.

On Thursday, October 27, 2016, Matei Zaharia 
wrote:

> Just to comment on this, I'm generally against removing these types of
> things unless they create a substantial burden on project contributors. It
> doesn't sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might,
> but then of course we need to wait for 2.12 to be out and stable.
>
> In general, this type of stuff only hurts users, and doesn't have a huge
> impact on Spark contributors' productivity (sure, it's a bit unpleasant,
> but that's life). If we break compatibility this way too quickly, we
> fragment the user community, and then either people have a crappy
> experience with Spark because their corporate IT doesn't yet have an
> environment that can run the latest version, or worse, they create more
> maintenance burden for us because they ask for more patches to be
> backported to old Spark versions (1.6.x, 2.0.x, etc). Python in particular
> is pretty fundamental to many Linux distros.
>
> In the future, rather than just looking at when some software came out, it
> may be good to have some criteria for when to drop support for something.
> For example, if there are really nice libraries in Python 2.7 or Java 8
> that we're missing out on, that may be a good reason. The maintenance
> burden for multiple Scala versions is definitely painful but I also think
> we should always support the latest two Scala releases.
>
> Matei
>
> On Oct 27, 2016, at 12:15 PM, Reynold Xin  > wrote:
>
> I created a JIRA ticket to track this: https://issues.apache.
> org/jira/browse/SPARK-18138
>
>
>
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran  > wrote:
>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen > > wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers > > wrote:
>>
>>> that sounds good to me
>>>
>>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin >> > wrote:
>>>
 We can do the following concrete proposal:

 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0
 (Mar/Apr 2017).

 2. In Spark 2.1.0 release, aggressively and explicitly announce the
 deprecation of Java 7 / Scala 2.10 support.

 (a) It should appear in release notes, documentations that mention how
 to build Spark

 (b) and a warning should be shown every time SparkContext is started
 using Scala 2.10 or Java 7.


>>
>
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Matei Zaharia
Just to comment on this, I'm generally against removing these types of things 
unless they create a substantial burden on project contributors. It doesn't 
sound like Python 2.6 and Java 7 do that yet -- Scala 2.10 might, but then of 
course we need to wait for 2.12 to be out and stable.

In general, this type of stuff only hurts users, and doesn't have a huge impact 
on Spark contributors' productivity (sure, it's a bit unpleasant, but that's 
life). If we break compatibility this way too quickly, we fragment the user 
community, and then either people have a crappy experience with Spark because 
their corporate IT doesn't yet have an environment that can run the latest 
version, or worse, they create more maintenance burden for us because they ask 
for more patches to be backported to old Spark versions (1.6.x, 2.0.x, etc). 
Python in particular is pretty fundamental to many Linux distros.

In the future, rather than just looking at when some software came out, it may 
be good to have some criteria for when to drop support for something. For 
example, if there are really nice libraries in Python 2.7 or Java 8 that we're 
missing out on, that may be a good reason. The maintenance burden for multiple 
Scala versions is definitely painful but I also think we should always support 
the latest two Scala releases.

Matei

> On Oct 27, 2016, at 12:15 PM, Reynold Xin  wrote:
> 
> I created a JIRA ticket to track this: 
> https://issues.apache.org/jira/browse/SPARK-18138 
> 
> 
> 
> 
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran  > wrote:
> 
>> On 27 Oct 2016, at 10:03, Sean Owen > > wrote:
>> 
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like to 
>> add that to a list of things that will begin to be unsupported 6 months from 
>> now.
>> 
> 
> If you go to java 8 only, then hadoop 2.6+ is mandatory. 
> 
> 
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers > > wrote:
>> that sounds good to me
>> 
>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin > > wrote:
>> We can do the following concrete proposal:
>> 
>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr 
>> 2017).
>> 
>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the 
>> deprecation of Java 7 / Scala 2.10 support.
>> 
>> (a) It should appear in release notes, documentations that mention how to 
>> build Spark
>> 
>> (b) and a warning should be shown every time SparkContext is started using 
>> Scala 2.10 or Java 7.
>> 
> 
> 



Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Yanbo Liang
+1

On Thu, Oct 27, 2016 at 3:15 AM, Reynold Xin  wrote:

> I created a JIRA ticket to track this: https://issues.apache.
> org/jira/browse/SPARK-18138
>
>
>
> On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
> wrote:
>
>>
>> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>>
>> Seems OK by me.
>> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
>> to add that to a list of things that will begin to be unsupported 6 months
>> from now.
>>
>>
>> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>>
>>
>> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>>
>>> that sounds good to me
>>>
>>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin 
>>> wrote:
>>>
>>> We can do the following concrete proposal:
>>>
>>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0
>>> (Mar/Apr 2017).
>>>
>>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
>>> deprecation of Java 7 / Scala 2.10 support.
>>>
>>> (a) It should appear in release notes, documentations that mention how
>>> to build Spark
>>>
>>> (b) and a warning should be shown every time SparkContext is started
>>> using Scala 2.10 or Java 7.
>>>
>>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Koert Kuipers
+1 non binding

compiled and unit tested in-house libraries against 2.0.2-rc1 successfully

was able to build, deploy and launch on cdh 5.7 yarn cluster

on a side note... these artifacts on staging repo having version 2.0.2
instead of 2.0.2-rc1 makes it somewhat dangerous to test against it in
existing project. i can add a sbt resolver for staging repo and change my
spark version to 2.0.2, but now it starts downloading and caching these
artifacts as version 2.0.2 which means i am now hunting down local cache
locations afterwards like ~/.ivy2/cache to make sure i wipe them or run the
risk of in the future compiling against the rc instead of the actual
release by accident. not sure if i should be doing something different?


On Thu, Oct 27, 2016 at 3:18 AM, Reynold Xin  wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Sean Owen
+1 from me. All the sigs and licenses and hashes check out. It builds and
passes tests with -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver on Ubuntu
16 + Java 8.

Here are the open issues for 2.0.2 BTW. No blockers, but some marked
Critical FYI. Just making sure nobody really meant for one of these to
definitely be in 2.0.2.

SPARK-14387 Enable Hive-1.x ORC compatibility with
spark.sql.hive.convertMetastoreOrc
SPARK-17822 JVMObjectTracker.objMap may leak JVM objects
SPARK-17823 Make JVMObjectTracker.objMap thread-safe
SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss
rows
SPARK-17972 Query planning slows down dramatically for large query plans
even when sub-trees are cached
SPARK-17981 Incorrectly Set Nullability to False in FilterExec
SPARK-17982 Spark 2.0.0  CREATE VIEW statement fails ::
java.lang.RuntimeException: Failed to analyze the canonicalized SQL. It is
possible there is a bug in Spark.


On Thu, Oct 27, 2016 at 9:46 AM Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1
>
> On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin  wrote:
>
> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1
> (1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)
>
> This release candidate resolves 75 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>
>


Spark 2.0 on HDP

2016-10-27 Thread Deenar Toraskar
Hi

Has anyone tried running Spark 2.0 on HDP. I have managed to get around the
issues with the timeline service (by turning it off), but now am stuck when
the YARN cannot find org.apache.spark.deploy.yarn.ExecutorLauncher.

Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher

I have verified that both spark.driver.extraJavaOptions and
spark.yarn.am.extraJavaOptions have the hdp.version set correctly. Anything
else I am missing?

Regards
Deenar



On 10 May 2016 at 13:48, Steve Loughran  wrote:

>
> On 9 May 2016, at 21:24, Jesse F Chen  wrote:
>
> I had been running fine until builds around 05/07/2016
>
> If I used the "--master yarn" in builds after 05/07, I got the following
> error...sounds like something jars are missing.
>
> I am using YARN 2.7.2 and Hive 1.2.1.
>
> Do I need something new to deploy related to YARN?
>
> bin/spark-sql -driver-memory 10g --verbose* --master yarn* --packages
> com.databricks:spark-csv_2.10:1.3.0 --executor-memory 4g --num-executors
> 20 --executor-cores 2
>
> 16/05/09 13:15:21 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/05/09 13:15:21 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4041
> 16/05/09 13:15:21 INFO util.Utils: Successfully started service 'SparkUI'
> on port 4041.
> 16/05/09 13:15:21 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started
> at http://bigaperf116.svl.ibm.com:4041
> *Exception in thread "main" java.lang.NoClassDefFoundError:
> com/sun/jersey/api/client/config/ClientConfig*
> *at
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)*
> at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.
> serviceInit(YarnClientImpl.java:163)
> at org.apache.hadoop.service.AbstractService.init(
> AbstractService.java:163)
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(
> YarnClientSchedulerBackend.scala:56)
> at org.apache.spark.scheduler.TaskSchedulerImpl.start(
> TaskSchedulerImpl.scala:148)
>
>
>
> Looks like Jersey client isn't on the classpath.
>
> 1. Consider filing a JIRA
> 2. set  spark.hadoop.yarn.timeline-service.enabled false to turn off ATS
>
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2246)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:762)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.
> init(SparkSQLEnv.scala:57)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(
> SparkSQLCLIDriver.scala:281)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(
> SparkSQLCLIDriver.scala:138)
> at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(
> SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:727)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.
> config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 22 more
> 16/05/09 13:15:21 INFO storage.DiskBlockManager: Shutdown hook called
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Shutdown hook called
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-ac33b501-b9c3-47a3-93c8-fa02720bf4bb
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-65cb43d9-c122-4106-a0a8-ae7d92d9e19c
> 16/05/09 13:15:21 INFO util.ShutdownHookManager: Deleting directory
> /tmp/spark-65cb43d9-c122-4106-a0a8-ae7d92d9e19c/userFiles-
> 46dde536-29e5-46b3-a530-e5ad6640f8b2
>
>
>
>
>
> <07983638.gif> *JESSE CHEN*
> Big Data Performance | IBM Analytics
>
> Office: 408 463 2296
> Mobile: 408 828 9068
> Email: jfc...@us.ibm.com
>
>
>


Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Assaf, thanks for the feedback!

On Thu, Oct 27, 2016 at 3:28 AM, assaf.mendelson 
wrote:

> Thanks.
>
> This article is excellent. It completely explains everything.
>
> I would add it as a reference to any and all explanations of structured
> streaming (and in the case of watermarking, I simply didn’t understand the
> definition before reading this).
>
>
>
> Thanks,
>
> Assaf.
>
>
>
>
>
> *From:* kostas papageorgopoylos [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]
> ]
> *Sent:* Thursday, October 27, 2016 10:17 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> Hi all
>
> I would highly recommend to all users-devs interested in the design
> suggestions / discussions for Structured Streaming Spark API watermarking
>
> to take a look on the following links along with the design document. It
> would help to understand the notions of watermark , out of order data and
> possible use cases.
>
>
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>
>
> Kind Regards
>
>
>
>
>
> 2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden email]
> >:
>
> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden
> email] [hidden email]
> ]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> > wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
>
> --
>
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>
>
> --
>
> *If you reply to this email, your message 

RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
Thanks.
This article is excellent. It completely explains everything.
I would add it as a reference to any and all explanations of structured 
streaming (and in the case of watermarking, I simply didn’t understand the 
definition before reading this).

Thanks,
Assaf.


From: kostas papageorgopoylos [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19592...@n3.nabble.com]
Sent: Thursday, October 27, 2016 10:17 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

Hi all

I would highly recommend to all users-devs interested in the design suggestions 
/ discussions for Structured Streaming Spark API watermarking
to take a look on the following links along with the design document. It would 
help to understand the notions of watermark , out of order data and possible 
use cases.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Kind Regards


2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden 
email]>:
Hi,
Should comments come here or in the JIRA?
Any, I am a little confused on the need to expose this as an API to begin with.
Let’s consider for a second the most basic behavior: We have some input stream 
and we want to aggregate a sum over a time window.
This means that the window we should be looking at would be the maximum time 
across our data and back by the window interval. Everything older can be 
dropped.
When new data arrives, the maximum time cannot move back so we generally drop 
everything tool old.
This basically means we save only the latest time window.
This simpler model would only break if we have a secondary aggregation which 
needs the results of multiple windows.
Is this the use case we are trying to solve?
If so, wouldn’t just calculating the bigger time window across the entire 
aggregation solve this?
Am I missing something here?

From: Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]]
Sent: Thursday, October 27, 2016 3:04 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124

On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden 
email]> wrote:
Hey all,

We are planning implement watermarking in Structured Streaming that would allow 
us handle late, out-of-order data better. Specially, when we are aggregating 
over windows on event-time, we currently can end up keeping unbounded amount 
data as state. We want to define watermarks on the event time in order mark and 
drop data that are "too late" and accordingly age out old aggregates that will 
not be updated any more.

To enable the user to specify details like lateness threshold, we are 
considering adding a new method to Dataset. We would like to get more feedback 
on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
To start a new topic under Apache Spark Developers List, email [hidden 
email]
To unsubscribe from Apache Spark Developers List, click here.
NAML


View this message in context: RE: Watermarking in Structured Streaming to drop 
late 
data
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19592.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.

Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Hello Assaf,

I think you are missing the fact that we want to compute over event-time of
the data (e.g. data generation time), which may arrive at Spark
out-of-order and late. And we want to aggregate over late data. The
watermark is an estimate made by the system that there wont be any data
later than the watermark time arriving after now.

If this basic context is clear, then please read the design doc for further
details. Please comments in the doc for more specific design discussions.

On Thu, Oct 27, 2016 at 1:52 AM, Ofir Manor  wrote:

> Assaf,
> I think you are using the term "window" differently than Structured
> Streaming,... Also, you didn't consider groupBy. Here is an example:
> I want to maintain, for every minute over the last six hours, a
> computation (trend or average or stddev) on a five-minute window (from t-4
> to t). So,
> 1. My window size is 5 minutes
> 2. The window slides every 1 minute (so, there is a new 5-minute window
> for every minute)
> 3. Old windows should be purged if they are 6 hours old (based on event
> time vs. clock?)
> Option 3 is currently missing - the streaming job keeps all windows
> forever, as the app may want to access very old windows, unless it would
> explicitly say otherwise.
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson 
> wrote:
>
>> Hi,
>>
>> Should comments come here or in the JIRA?
>>
>> Any, I am a little confused on the need to expose this as an API to begin
>> with.
>>
>> Let’s consider for a second the most basic behavior: We have some input
>> stream and we want to aggregate a sum over a time window.
>>
>> This means that the window we should be looking at would be the maximum
>> time across our data and back by the window interval. Everything older can
>> be dropped.
>>
>> When new data arrives, the maximum time cannot move back so we generally
>> drop everything tool old.
>>
>> This basically means we save only the latest time window.
>>
>> This simpler model would only break if we have a secondary aggregation
>> which needs the results of multiple windows.
>>
>> Is this the use case we are trying to solve?
>>
>> If so, wouldn’t just calculating the bigger time window across the entire
>> aggregation solve this?
>>
>> Am I missing something here?
>>
>>
>>
>> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
>> ml-node+[hidden email]
>> ]
>> *Sent:* Thursday, October 27, 2016 3:04 AM
>> *To:* Mendelson, Assaf
>> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>>
>>
>>
>> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>>
>>
>>
>> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
>> > wrote:
>>
>> Hey all,
>>
>>
>>
>> We are planning implement watermarking in Structured Streaming that would
>> allow us handle late, out-of-order data better. Specially, when we are
>> aggregating over windows on event-time, we currently can end up keeping
>> unbounded amount data as state. We want to define watermarks on the event
>> time in order mark and drop data that are "too late" and accordingly age
>> out old aggregates that will not be updated any more.
>>
>>
>>
>> To enable the user to specify details like lateness threshold, we are
>> considering adding a new method to Dataset. We would like to get more
>> feedback on this API. Here is the design doc
>>
>>
>>
>> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5x
>> wqaNQl6ZLIS03xhkfCQ/
>>
>>
>>
>> Please comment on the design and proposed APIs.
>>
>>
>>
>> Thank you very much!
>>
>>
>>
>> TD
>>
>>
>>
>>
>> --
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] 
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> 
>>
>> --
>> View this message in context: RE: Watermarking in Structured Streaming
>> to drop late data
>> 
>> Sent from the Apache Spark Developers List mailing list archive
>> 

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Reynold Xin
I created a JIRA ticket to track this:
https://issues.apache.org/jira/browse/SPARK-18138



On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran 
wrote:

>
> On 27 Oct 2016, at 10:03, Sean Owen  wrote:
>
> Seems OK by me.
> How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like
> to add that to a list of things that will begin to be unsupported 6 months
> from now.
>
>
> If you go to java 8 only, then hadoop 2.6+ is mandatory.
>
>
> On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:
>
>> that sounds good to me
>>
>> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:
>>
>> We can do the following concrete proposal:
>>
>> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr
>> 2017).
>>
>> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
>> deprecation of Java 7 / Scala 2.10 support.
>>
>> (a) It should appear in release notes, documentations that mention how to
>> build Spark
>>
>> (b) and a warning should be shown every time SparkContext is started
>> using Scala 2.10 or Java 7.
>>
>>
>


Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Ofir Manor
Assaf,
I think you are using the term "window" differently than Structured
Streaming,... Also, you didn't consider groupBy. Here is an example:
I want to maintain, for every minute over the last six hours, a computation
(trend or average or stddev) on a five-minute window (from t-4 to t). So,
1. My window size is 5 minutes
2. The window slides every 1 minute (so, there is a new 5-minute window for
every minute)
3. Old windows should be purged if they are 6 hours old (based on event
time vs. clock?)
Option 3 is currently missing - the streaming job keeps all windows
forever, as the app may want to access very old windows, unless it would
explicitly say otherwise.


Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson 
wrote:

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> ]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> > wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Steve Loughran

On 27 Oct 2016, at 10:03, Sean Owen 
> wrote:

Seems OK by me.
How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like to add 
that to a list of things that will begin to be unsupported 6 months from now.


If you go to java 8 only, then hadoop 2.6+ is mandatory.


On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers 
> wrote:
that sounds good to me

On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin 
> wrote:
We can do the following concrete proposal:

1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr 2017).

2. In Spark 2.1.0 release, aggressively and explicitly announce the deprecation 
of Java 7 / Scala 2.10 support.

(a) It should appear in release notes, documentations that mention how to build 
Spark

(b) and a warning should be shown every time SparkContext is started using 
Scala 2.10 or Java 7.




Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Sean Owen
Seems OK by me.
How about Hadoop < 2.6, Python 2.6? Those seem more removeable. I'd like to
add that to a list of things that will begin to be unsupported 6 months
from now.

On Wed, Oct 26, 2016 at 8:49 PM Koert Kuipers  wrote:

> that sounds good to me
>
> On Wed, Oct 26, 2016 at 2:26 PM, Reynold Xin  wrote:
>
> We can do the following concrete proposal:
>
> 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr
> 2017).
>
> 2. In Spark 2.1.0 release, aggressively and explicitly announce the
> deprecation of Java 7 / Scala 2.10 support.
>
> (a) It should appear in release notes, documentations that mention how to
> build Spark
>
> (b) and a warning should be shown every time SparkContext is started using
> Scala 2.10 or Java 7.
>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Herman van Hövell tot Westerflier
+1

On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin  wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


[VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Reynold Xin
Greetings from Spark Summit Europe at Brussels.

Please vote on releasing the following candidate as Apache Spark version
2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc1
(1c2908eeb8890fdc91413a3f5bad2bb3d114db6c)

This release candidate resolves 75 issues:
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1208/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present
in 2.0.1, missing features, or bugs related to new features will not
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
(i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.


Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread kostas papageorgopoylos
Hi all

I would highly recommend to all users-devs interested in the design
suggestions / discussions for Structured Streaming Spark API watermarking
to take a look on the following links along with the design document. It
would help to understand the notions of watermark , out of order data and
possible use cases.

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Kind Regards


2016-10-27 9:46 GMT+03:00 assaf.mendelson :

> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
> ml-node+[hidden email]
> ]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> > wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: Watermarking in Structured Streaming to
> drop late data
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


RE: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread assaf.mendelson
Hi,
Should comments come here or in the JIRA?
Any, I am a little confused on the need to expose this as an API to begin with.
Let’s consider for a second the most basic behavior: We have some input stream 
and we want to aggregate a sum over a time window.
This means that the window we should be looking at would be the maximum time 
across our data and back by the window interval. Everything older can be 
dropped.
When new data arrives, the maximum time cannot move back so we generally drop 
everything tool old.
This basically means we save only the latest time window.
This simpler model would only break if we have a secondary aggregation which 
needs the results of multiple windows.
Is this the use case we are trying to solve?
If so, wouldn’t just calculating the bigger time window across the entire 
aggregation solve this?
Am I missing something here?

From: Michael Armbrust [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19590...@n3.nabble.com]
Sent: Thursday, October 27, 2016 3:04 AM
To: Mendelson, Assaf
Subject: Re: Watermarking in Structured Streaming to drop late data

And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124

On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden 
email]> wrote:
Hey all,

We are planning implement watermarking in Structured Streaming that would allow 
us handle late, out-of-order data better. Specially, when we are aggregating 
over windows on event-time, we currently can end up keeping unbounded amount 
data as state. We want to define watermarks on the event time in order mark and 
drop data that are "too late" and accordingly age out old aggregates that will 
not be updated any more.

To enable the user to specify details like lateness threshold, we are 
considering adding a new method to Dataset. We would like to get more feedback 
on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19591.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.