Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-31 Thread Kostas Sakellis
Hey Michael,

There is a discussion on TIMESTAMP semantics going on the thread "SQL
TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should
we make a decision there before voting on the next RC for Spark 2.2?

Thanks,
Kostas

On Tue, May 30, 2017 at 12:09 PM, Michael Armbrust 
wrote:

> Last call, anything else important in-flight for 2.2?
>
> On Thu, May 25, 2017 at 10:56 AM, Michael Allman 
> wrote:
>
>> PR is here: https://github.com/apache/spark/pull/18112
>>
>>
>> On May 25, 2017, at 10:28 AM, Michael Allman 
>> wrote:
>>
>> Michael,
>>
>> If you haven't started cutting the new RC, I'm working on a documentation
>> PR right now I'm hoping we can get into Spark 2.2 as a migration note, even
>> if it's just a mention: https://issues.apache.org/jira/browse/SPARK-20888
>> .
>>
>> Michael
>>
>>
>> On May 22, 2017, at 11:39 AM, Michael Armbrust 
>> wrote:
>>
>> I'm waiting for SPARK-20814
>>  at Marcelo's
>> request and I'd also like to include SPARK-20844
>> .  I think we should
>> be able to cut another RC midweek.
>>
>> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> All the outstanding ML QA doc and user guide items are done for 2.2 so
>>> from that side we should be good to cut another RC :)
>>>
>>>
>>> On Thu, 18 May 2017 at 00:18 Russell Spitzer 
>>> wrote:
>>>
 Seeing an issue with the DataScanExec and some of our integration tests
 for the SCC. Running dataframe read and writes from the shell seems fine
 but the Redaction code seems to get a "None" when doing
 SparkSession.getActiveSession.get in our integration tests. I'm not
 sure why but i'll dig into this later if I get a chance.

 Example Failed Test
 https://github.com/datastax/spark-cassandra-connector/blob/v
 2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/sp
 ark/connector/sql/CassandraSQLSpec.scala#L311

 ```[info]   org.apache.spark.SparkException: Job aborted due to stage
 failure: Task serialization failed: java.util.NoSuchElementException:
 None.get
 [info] java.util.NoSuchElementException: None.get
 [info] at scala.None$.get(Option.scala:347)
 [info] at scala.None$.get(Option.scala:345)
 [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org$
 apache$spark$sql$execution$DataSourceScanExec$$redact(DataSo
 urceScanExec.scala:70)
 [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4
 .apply(DataSourceScanExec.scala:54)
 [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4
 .apply(DataSourceScanExec.scala:52)
 ```

 Again this only seems to repo in our IT suite so i'm not sure if this
 is a real issue.


 On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
 wrote:

> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.
> Thanks everyone who helped out on those!
>
> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
> essentially all for documentation.
>
> Joseph
>
> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
> wrote:
>
>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>> fixed, since the change that caused it is in branch-2.2. Probably a
>> good idea to raise it to blocker at this point.
>>
>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>  wrote:
>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>> create
>> > another RC once ML has had time to take care of the more critical
>> problems.
>> > In the meantime please keep testing this release!
>> >
>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <
>> ishiz...@jp.ibm.com>
>> > wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
>> tests for
>> >> core have passed.
>> >>
>> >> $ java -version
>> >> openjdk version "1.8.0_111"
>> >> OpenJDK Runtime Environment (build
>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn
>> -Phadoop-2.7
>> >> package install
>> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test
>> -pl core
>> >> ...
>> >> Run completed in 15 minutes, 12 seconds.
>> >> Total number of tests run: 1940
>> >> Suites: completed 206, aborted 0
>> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
>> >> All tests passed.
>> >> [INFO]
>> >> 

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-05 Thread Kostas Sakellis
>From both this and the JDK thread, I've noticed (including myself) that
people have different notions of compatibility guarantees between major and
minor versions.
A simple question I have is: What compatibility can we break between minor
vs. major releases?

It might be worth getting on the same page wrt compatibility guarantees.

Just a thought,
Kostas

On Tue, Apr 5, 2016 at 4:39 PM, Holden Karau  wrote:

> One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is
> deprecation warnings in our builds that we can't fix without introducing a
> wrapper/ scala version specific code. This isn't a big deal, and if we drop
> 2.10 in the 3-6 month time frame talked about we can cleanup those warnings
> once we get there.
>
> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors <
> raymond.honderd...@sizmek.com> wrote:
>
>> What about a seperate branch for scala 2.10?
>>
>>
>>
>> Sent from my Samsung Galaxy smartphone.
>>
>>
>>  Original message 
>> From: Koert Kuipers 
>> Date: 4/2/2016 02:10 (GMT+02:00)
>> To: Michael Armbrust 
>> Cc: Matei Zaharia , Mark Hamstra <
>> m...@clearstorydata.com>, Cody Koeninger , Sean Owen
>> , dev@spark.apache.org
>> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x
>> lifecycle
>>
>> as long as we don't lock ourselves into supporting scala 2.10 for the
>> entire spark 2 lifespan it sounds reasonable to me
>>
>> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust > > wrote:
>>
>>> +1 to Matei's reasoning.
>>>
>>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia 
>>> wrote:
>>>
 I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
 entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
 the default version we built with in 1.x. We want to make the transition
 from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
 downloads be for Scala 2.11, so people will more easily move, but we
 shouldn't create obstacles that lead to fragmenting the community and
 slowing down Spark 2.0's adoption. I've seen companies that stayed on an
 old Scala version for multiple years because switching it, or mixing
 versions, would affect the company's entire codebase.

 Matei

 On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:

 oh wow, had no idea it got ripped out

 On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra  wrote:

> No, with 2.0 Spark really doesn't use Akka:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
>
> On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers 
> wrote:
>
>> Spark still runs on akka. So if you want the benefits of the latest
>> akka (not saying we do, was just an example) then you need to drop scala
>> 2.10
>> On Mar 30, 2016 10:44 AM, "Cody Koeninger" 
>> wrote:
>>
>>> I agree with Mark in that I don't see how supporting scala 2.10 for
>>> spark 2.0 implies supporting it for all of spark 2.x
>>>
>>> Regarding Koert's comment on akka, I thought all akka dependencies
>>> have been removed from spark after SPARK-7997 and the recent removal
>>> of external/akka
>>>
>>> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>> > Dropping Scala 2.10 support has to happen at some point, so I'm not
>>> > fundamentally opposed to the idea; but I've got questions about
>>> how we go
>>> > about making the change and what degree of negative consequences
>>> we are
>>> > willing to accept.  Until now, we have been saying that 2.10
>>> support will be
>>> > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial
>>> for some
>>> > Spark users, so abruptly dropping 2.10 support is very likely to
>>> delay
>>> > migration to Spark 2.0 for those users.
>>> >
>>> > What about continuing 2.10 support in 2.0.x, but repeatedly making
>>> an
>>> > obvious announcement in multiple places that such support is
>>> deprecated,
>>> > that we are not committed to maintaining it throughout 2.x, and
>>> that it is,
>>> > in fact, scheduled to be removed in 2.1.0?
>>> >
>>> > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen 
>>> wrote:
>>> >>
>>> >> (This should fork as its own thread, though it began during
>>> discussion
>>> >> of whether to continue Java 7 support in Spark 2.x.)
>>> >>
>>> >> Simply: would like to more clearly take the temperature of all
>>> >> interested parties about whether to support Scala 2.10 in the
>>> Spark
>>> >> 2.x lifecycle. Some of 

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Kostas Sakellis
Also, +1 on dropping jdk7 in Spark 2.0.

Kostas

On Mon, Mar 28, 2016 at 2:01 PM, Marcelo Vanzin  wrote:

> Finally got some internal feedback on this, and we're ok with
> requiring people to deploy jdk8 for 2.0, so +1 too.
>
> On Mon, Mar 28, 2016 at 1:15 PM, Luciano Resende 
> wrote:
> > +1, I also checked with few projects inside IBM that consume Spark and
> they
> > seem to be ok with the direction of droping JDK 7.
> >
> > On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt <
> mgumm...@mesosphere.io>
> > wrote:
> >>
> >> +1 from Mesosphere
> >>
> >> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran  >
> >> wrote:
> >>>
> >>>
> >>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan 
> wrote:
> >>> >
> >>> > Removing compatibility (with jdk, etc) can be done with a major
> >>> > release- given that 7 has been EOLed a while back and is now
> unsupported, we
> >>> > have to decide if we drop support for it in 2.0 or 3.0 (2+ years
> from now).
> >>> >
> >>> > Given the functionality & performance benefits of going to jdk8,
> future
> >>> > enhancements relevant in 2.x timeframe ( scala, dependencies) which
> requires
> >>> > it, and simplicity wrt code, test & support it looks like a good
> checkpoint
> >>> > to drop jdk7 support.
> >>> >
> >>> > As already mentioned in the thread, existing yarn clusters are
> >>> > unaffected if they want to continue running jdk7 and yet use spark2
> (install
> >>> > jdk8 on all nodes and use it via JAVA_HOME, or worst case distribute
> jdk8 as
> >>> > archive - suboptimal).
> >>>
> >>> you wouldn't want to dist it as an archive; it's not just the binaries,
> >>> it's the install phase. And you'd better remember to put the JCE jar
> in on
> >>> top of the JDK for kerberos to work.
> >>>
> >>> setting up environment vars to point to JDK8 in the launched
> >>> app/container avoids that. Yes, the ops team do need to install java,
> but if
> >>> you offer them the choice of "installing a centrally managed Java" and
> >>> "having my code try and install it", they should go for the managed
> option.
> >>>
> >>> One thing to consider for 2.0 is to make it easier to set up those env
> >>> vars for both python and java. And, as the techniques for mixing JDK
> >>> versions is clearly not that well known, documenting it.
> >>>
> >>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
> >>> gets you is changes in the hadoop-native libs; you do need to get the
> PATH
> >>> var spot on)
> >>>
> >>>
> >>> > I am unsure about mesos (standalone might be easier upgrade I guess
> ?).
> >>> >
> >>> >
> >>> > Proposal is for 1.6x line to continue to be supported with critical
> >>> > fixes; newer features will require 2.x and so jdk8
> >>> >
> >>> > Regards
> >>> > Mridul
> >>> >
> >>> >
> >>>
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Michael Gummelt
> >> Software Engineer
> >> Mesosphere
> >
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Kostas Sakellis
In addition, with Spark 2.0, we are throwing away binary compatibility
anyways so user applications will have to be recompiled.

The only argument I can see is for libraries that have already been built
on Scala 2.10 that are no longer being maintained. How big of an issue do
we think that is?

Kostas

On Thu, Mar 24, 2016 at 4:48 PM, Marcelo Vanzin  wrote:

> On Thu, Mar 24, 2016 at 4:46 PM, Reynold Xin  wrote:
> > Actually it's *way* harder to upgrade Scala from 2.10 to 2.11, than
> > upgrading the JVM runtime from 7 to 8, because Scala 2.10 and 2.11 are
> not
> > binary compatible, whereas JVM 7 and 8 are binary compatible except
> certain
> > esoteric cases.
>
> True, but ask anyone who manages a large cluster how long it would
> take them to upgrade the jdk across their cluster and validate all
> their applications and everything... binary compatibility is a tiny
> drop in that bucket.
>
> --
> Marcelo
>


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Kostas Sakellis
If an argument here is the ongoing build/maintenance burden I think we
should seriously consider dropping scala 2.10 in Spark 2.0. Supporting
scala 2.10 is bigger build/infrastructure burden than supporting jdk7 since
you actually have to build different artifacts and test them whereas you
can target Spark onto 1.7 and just test it on JDK8.

In addition, as others pointed out, it seems like a bigger pain to drop
support for a JDK than scala version. So if we are considering dropping
java 7, which is a breaking change on the infra side, now is also a good
time to drop Scala 2.10 support.

Kostas

P.S. I haven't heard anyone on this thread fight for Scala 2.10 support.

On Thu, Mar 24, 2016 at 2:46 PM, Marcelo Vanzin  wrote:

> On Thu, Mar 24, 2016 at 2:41 PM, Jakob Odersky  wrote:
> > You can, but since it's going to be a maintainability issue I would
> > argue it is in fact a problem.
>
> Every thing you choose to support generates a maintenance burden.
> Support 3 versions of Scala would be a huge maintenance burden, for
> example, as is supporting 2 versions of the JDK. Just note that,
> technically, we do support 2 versions of the jdk today; we just don't
> do a lot of automated testing on jdk 8 (PRs are all built with jdk 7
> AFAIK).
>
> So at the end it's a compromise. How many users will be affected by
> your choices? That's the question that I think is the most important.
> If switching to java 8-only means a bunch of users won't be able to
> upgrade, it means that Spark 2.0 will get less use than 1.x and will
> take longer to gain traction. That has other ramifications - such as
> less use means less issues might be found and the overall quality may
> suffer in the beginning of this transition.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


SPARK-13843 Next steps

2016-03-22 Thread Kostas Sakellis
Hello all,

I'd like to close out the discussion on SPARK-13843 by getting a poll from
the community on which components we should seriously reconsider re-adding
back to Apache Spark. For reference, here are the modules that were removed
as part of SPARK-13843 and pushed to: https://github.com/spark-packages

   - streaming-flume
   - streaming-akka
   - streaming-mqtt
   - streaming-zeromq
   - streaming-twitter

For us, we'd like to see the streaming-flume added back to Apache Spark.

Thanks,
Kostas


Re: A proposal for Spark 2.0

2015-12-08 Thread Kostas Sakellis
I'd also like to make it a requirement that Spark 2.0 have a stable
dataframe and dataset API - we should not leave these APIs experimental in
the 2.0 release. We already know of at least one breaking change we need to
make to dataframes, now's the time to make any other changes we need to
stabilize these APIs. Anything we can do to make us feel more comfortable
about the dataset and dataframe APIs before the 2.0 release?

I've also been thinking that in Spark 2.0, we might want to consider strict
classpath isolation for user applications. Hadoop 3 is moving in this
direction. We could, for instance, run all user applications in their own
classloader that only inherits very specific classes from Spark (ie. public
APIs). This will require user apps to explicitly declare their dependencies
as there won't be any accidental class leaking anymore. We do something
like this for *userClasspathFirst option but it is not as strict as what I
described. This is a breaking change but I think it will help with
eliminating weird classpath incompatibility issues between user
applications and Spark system dependencies.

Thoughts?

Kostas


On Fri, Dec 4, 2015 at 3:28 AM, Sean Owen  wrote:

> To be clear-er, I don't think it's clear yet whether a 1.7 release
> should exist or not. I could see both making sense. It's also not
> really necessary to decide now, well before a 1.6 is even out in the
> field. Deleting the version lost information, and I would not have
> done that given my reply. Reynold maybe I can take this up with you
> offline.
>
> On Thu, Dec 3, 2015 at 6:03 PM, Mark Hamstra 
> wrote:
> > Reynold's post fromNov. 25:
> >
> >> I don't think we should drop support for Scala 2.10, or make it harder
> in
> >> terms of operations for people to upgrade.
> >>
> >> If there are further objections, I'm going to bump remove the 1.7
> version
> >> and retarget things to 2.0 on JIRA.
> >
> >
> > On Thu, Dec 3, 2015 at 12:47 AM, Sean Owen  wrote:
> >>
> >> Reynold, did you (or someone else) delete version 1.7.0 in JIRA? I
> >> think that's premature. If there's a 1.7.0 then we've lost info about
> >> what it would contain. It's trivial at any later point to merge the
> >> versions. And, since things change and there's not a pressing need to
> >> decide one way or the other, it seems fine to at least collect this
> >> info like we have things like "1.4.3" that may never be released. I'd
> >> like to add it back?
> >>
> >> On Thu, Nov 26, 2015 at 9:45 AM, Sean Owen  wrote:
> >> > Maintaining both a 1.7 and 2.0 is too much work for the project, which
> >> > is over-stretched now. This means that after 1.6 it's just small
> >> > maintenance releases in 1.x and no substantial features or evolution.
> >> > This means that the "in progress" APIs in 1.x that will stay that way,
> >> > unless one updates to 2.x. It's not unreasonable, but means the update
> >> > to the 2.x line isn't going to be that optional for users.
> >> >
> >> > Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
> >> > it for a couple years, note. 2.10 is still used today, but that's the
> >> > point of the current stable 1.x release in general: if you want to
> >> > stick to current dependencies, stick to the current release. Although
> >> > I think that's the right way to think about support across major
> >> > versions in general, I can see that 2.x is more of a required update
> >> > for those following the project's fixes and releases. Hence may indeed
> >> > be important to just keep supporting 2.10.
> >> >
> >> > I can't see supporting 2.12 at the same time (right?). Is that a
> >> > concern? it will be long since GA by the time 2.x is first released.
> >> >
> >> > There's another fairly coherent worldview where development continues
> >> > in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
> >> > 2.0 is delayed somewhat into next year, and by that time supporting
> >> > 2.11+2.12 and Java 8 looks more feasible and more in tune with
> >> > currently deployed versions.
> >> >
> >> > I can't say I have a strong view but I personally hadn't imagined 2.x
> >> > would start now.
> >> >
> >> >
> >> > On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin 
> >> > wrote:
> >> >> I don't think we should drop support for Scala 2.10, or make it
> harder
> >> >> in
> >> >> terms of operations for people to upgrade.
> >> >>
> >> >> If there are further objections, I'm going to bump remove the 1.7
> >> >> version
> >> >> and retarget things to 2.0 on JIRA.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: A proposal for Spark 2.0

2015-11-13 Thread Kostas Sakellis
We have veered off the topic of Spark 2.0 a little bit here - yes we can
talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to
propose we have one more 1.x release after Spark 1.6. This will allow us to
stabilize a few of the new features that were added in 1.6:

1) the experimental Datasets API
2) the new unified memory manager.

I understand our goal for Spark 2.0 is to offer an easy transition but
there will be users that won't be able to seamlessly upgrade given what we
have discussed as in scope for 2.0. For these users, having a 1.x release
with these new features/APIs stabilized will be very beneficial. This might
make Spark 1.7 a lighter release but that is not necessarily a bad thing.

Any thoughts on this timeline?

Kostas Sakellis



On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Agree, more features/apis/optimization need to be added in DF/DS.
>
>
>
> I mean, we need to think about what kind of RDD APIs we have to provide to
> developer, maybe the fundamental API is enough, like, the ShuffledRDD
> etc..  But PairRDDFunctions probably not in this category, as we can do the
> same thing easily with DF/DS, even better performance.
>
>
>
> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
> *Sent:* Friday, November 13, 2015 11:23 AM
> *To:* Stephen Boesch
>
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Hmmm... to me, that seems like precisely the kind of thing that argues for
> retaining the RDD API but not as the first thing presented to new Spark
> developers: "Here's how to use groupBy with DataFrames Until the
> optimizer is more fully developed, that won't always get you the best
> performance that can be obtained.  In these particular circumstances, ...,
> you may want to use the low-level RDD API while setting
> preservesPartitioning to true.  Like this"
>
>
>
> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> My understanding is that  the RDD's presently have more support for
> complete control of partitioning which is a key consideration at scale.
> While partitioning control is still piecemeal in  DF/DS  it would seem
> premature to make RDD's a second-tier approach to spark dev.
>
>
>
> An example is the use of groupBy when we know that the source relation
> (/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
> sql still does not allow that knowledge to be applied to the optimizer - so
> a full shuffle will be performed. However in the native RDD we can use
> preservesPartitioning=true.
>
>
>
> 2015-11-12 17:42 GMT-08:00 Mark Hamstra <m...@clearstorydata.com>:
>
> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
>
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>
> I am not sure what the best practice for this specific problem, but it’s
> really worth to think about it in 2.0, as it is a painful issue for lots of
> users.
>
>
>
> By the way, is it also an opportunity to deprecate the RDD API (or
> internal API only?)? As lots of its functionality overlapping with
> DataFrame or DataSet.
>
>
>
> Hao
>
>
>
> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
> *Sent:* Friday, November 13, 2015 5:27 AM
> *To:* Nicholas Chammas
> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
> Reynold Xin
>
>
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> I know we want to keep breaking changes to a minimum but I'm hoping that
> with Spark 2.0 we can also look at better classpath isolation with user
> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
> setting it true by default, and not allow any spark transitive dependencies
> to leak into user code. For backwards compatibility we can have a whi

Re: A proposal for Spark 2.0

2015-11-12 Thread Kostas Sakellis
I know we want to keep breaking changes to a minimum but I'm hoping that
with Spark 2.0 we can also look at better classpath isolation with user
programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
setting it true by default, and not allow any spark transitive dependencies
to leak into user code. For backwards compatibility we can have a whitelist
if we want but I'd be good if we start requiring user apps to explicitly
pull in all their dependencies. From what I can tell, Hadoop 3 is also
moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
> On that note of deprecating stuff, it might be good to deprecate some
> things in 2.0 without removing or replacing them immediately. That way 2.0
> doesn’t have to wait for everything that we want to deprecate to be
> replaced all at once.
>
> Nick
> ​
>
> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features for
>> machine learning, For example, the parameter server.
>>
>>
>>
>>
>>
>> 在 2015年11月12日,05:32,Matei Zaharia  写道:
>>
>>
>>
>> I like the idea of popping out Tachyon to an optional component too to
>> reduce the number of dependencies. In the future, it might even be useful
>> to do this for Hadoop, but it requires too many API changes to be worth
>> doing now.
>>
>>
>>
>> Regarding Scala 2.12, we should definitely support it eventually, but I
>> don't think we need to block 2.0 on that because it can be added later too.
>> Has anyone investigated what it would take to run on there? I imagine we
>> don't need many code changes, just maybe some REPL stuff.
>>
>>
>>
>> Needless to say, but I'm all for the idea of making "major" releases as
>> undisruptive as possible in the model Reynold proposed. Keeping everyone
>> working with the same set of releases is super important.
>>
>>
>>
>> Matei
>>
>>
>>
>> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>>
>>
>>
>> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
>> wrote:
>>
>> to the Spark community. A major release should not be very different from
>> a
>>
>> minor release and should not be gated based on new features. The main
>>
>> purpose of a major release is an opportunity to fix things that are broken
>>
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>>
>>
>> Agree with this stance. Generally, a major release might also be a
>>
>> time to replace some big old API or implementation with a new one, but
>>
>> I don't see obvious candidates.
>>
>>
>>
>> I wouldn't mind turning attention to 2.x sooner than later, unless
>>
>> there's a fairly good reason to continue adding features in 1.x to a
>>
>> 1.7 release. The scope as of 1.6 is already pretty darned big.
>>
>>
>>
>>
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but
>>
>> it has been end-of-life.
>>
>>
>>
>> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>>
>> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>>
>> dropping 2.10. Otherwise it's supported for 2 more years.
>>
>>
>>
>>
>>
>> 2. Remove Hadoop 1 support.
>>
>>
>>
>> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>>
>> sort of 'alpha' and 'beta' releases) and even <2.6.
>>

Re: A proposal for Spark 2.0

2015-11-10 Thread Kostas Sakellis
+1 on a lightweight 2.0

What is the thinking around the 1.x line after Spark 2.0 is released? If
not terminated, how will we determine what goes into each major version
line? Will 1.x only be for stability fixes?

Thanks,
Kostas

On Tue, Nov 10, 2015 at 3:41 PM, Patrick Wendell  wrote:

> I also feel the same as Reynold. I agree we should minimize API breaks and
> focus on fixing things around the edge that were mistakes (e.g. exposing
> Guava and Akka) rather than any overhaul that could fragment the community.
> Ideally a major release is a lightweight process we can do every couple of
> years, with minimal impact for users.
>
> - Patrick
>
> On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> > For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model.
>>
>> +1 for this. The Python community went through a lot of turmoil over the
>> Python 2 -> Python 3 transition because the upgrade process was too painful
>> for too long. The Spark community will benefit greatly from our explicitly
>> looking to avoid a similar situation.
>>
>> > 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>> Could you elaborate a bit on this? I'm not sure what an assembly-free
>> distribution means.
>>
>> Nick
>>
>> On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:
>>
>>> I’m starting a new thread since the other one got intermixed with
>>> feature requests. Please refrain from making feature request in this
>>> thread. Not that we shouldn’t be adding features, but we can always add
>>> features in 1.7, 2.1, 2.2, ...
>>>
>>> First - I want to propose a premise for how to think about Spark 2.0 and
>>> major releases in Spark, based on discussion with several members of the
>>> community: a major release should be low overhead and minimally disruptive
>>> to the Spark community. A major release should not be very different from a
>>> minor release and should not be gated based on new features. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs (examples follow).
>>>
>>> For this reason, I would *not* propose doing major releases to break
>>> substantial API's or perform large re-architecting that prevent users from
>>> upgrading. Spark has always had a culture of evolving architecture
>>> incrementally and making changes - and I don't think we want to change this
>>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>>
>>> If the community likes the above model, then to me it seems reasonable
>>> to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
>>> immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A
>>> cadence of major releases every 2 years seems doable within the above model.
>>>
>>> Under this model, here is a list of example things I would propose doing
>>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>>
>>>
>>> APIs
>>>
>>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>>> Spark 1.x.
>>>
>>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>>> about user applications being unable to use Akka due to Spark’s dependency
>>> on Akka.
>>>
>>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>>
>>> 4. Better class package structure for low level developer API’s. In
>>> particular, we have some DeveloperApi (mostly various listener-related
>>> classes) added over the years. Some packages include only one or two public
>>> classes but a lot of private classes. A better structure is to have public
>>> classes isolated to a few public packages, and these public packages should
>>> have minimal private classes for low level developer APIs.
>>>
>>> 5. Consolidate task metric and accumulator API. Although having some
>>> subtle differences, these two are very similar but have completely
>>> different code path.
>>>
>>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>>> moving them to other package(s). They are already used beyond SQL, e.g. in
>>> ML pipelines, and will be used by streaming also.
>>>
>>>
>>> Operation/Deployment
>>>
>>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>>> but it has been end-of-life.
>>>
>>> 2. Remove Hadoop 1 support.
>>>
>>> 3. Assembly-free distribution of Spark: don’t require building an
>>> enormous assembly jar in order to run Spark.
>>>
>>>
>