Regarding Python 3.x upgrade referenced earlier. Some people already gone down that path of upgrading:
https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-largest-python-3-migrations-ever They describe some good reasons. Stavros On Tue, Sep 18, 2018 at 6:35 PM, Erik Erlandson <eerla...@redhat.com> wrote: > I like the notion of empowering cross platform bindings. > > The trend of computing frameworks seems to be that all APIs gradually > converge on a stable attractor which could be described as "data frames and > SQL" Spark's early API design was RDD focused, but these days the center > of gravity is all about DataFrame (Python's prevalence combined with its > lack of a static type system substantially dilutes the benefits of DataSet, > for any library development that aspires to both JVM and python support). > > I can imagine optimizing the developer layers of Spark APIs so that cross > platform support and also 3rd-party support for new and existing Spark > bindings would be maximized for "parallelizable dataframe+SQL" Another of > Spark's strengths is it's ability to federate heterogeneous data sources, > and making cross platform bindings easy for that is desirable. > > > On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> It's not splitting hairs, Erik. It's actually very close to something >> that I think deserves some discussion (perhaps on a separate thread.) What >> I've been thinking about also concerns API "friendliness" or style. The >> original RDD API was very intentionally modeled on the Scala parallel >> collections API. That made it quite friendly for some Scala programmers, >> but not as much so for users of the other language APIs when they >> eventually came about. Similarly, the Dataframe API drew a lot from pandas >> and R, so it is relatively friendly for those used to those abstractions. >> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL. >> The new barrier scheduling draws inspiration from MPI. With all of these >> models and sources of inspiration, as well as multiple language targets, >> there isn't really a strong sense of coherence across Spark -- I mean, even >> though one of the key advantages of Spark is the ability to do within a >> single framework things that would otherwise require multiple frameworks, >> actually doing that is requiring more than one programming style or >> multiple design abstractions more than what is strictly necessary even when >> writing Spark code in just a single language. >> >> For me, that raises questions over whether we want to start designing, >> implementing and supporting APIs that are designed to be more consistent, >> friendly and idiomatic to particular languages and abstractions -- e.g. an >> API covering all of Spark that is designed to look and feel as much like >> "normal" code for a Python programmer, another that looks and feels more >> like "normal" Java code, another for Scala, etc. That's a lot more work and >> support burden than the current approach where sometimes it feels like you >> are writing "normal" code for your prefered programming environment, and >> sometimes it feels like you are trying to interface with something foreign, >> but underneath it hopefully isn't too hard for those writing the >> implementation code below the APIs, and it is not too hard to maintain >> multiple language bindings that are each fairly lightweight. >> >> It's a cost-benefit judgement, of course, whether APIs that are heavier >> (in terms of implementing and maintaining) and friendlier (for end users) >> are worth doing, and maybe some of these "friendlier" APIs can be done >> outside of Spark itself (imo, Frameless is doing a very nice job for the >> parts of Spark that it is currently covering -- >> https://github.com/typelevel/frameless); but what we have currently is a >> bit too ad hoc and fragmentary for my taste. >> >> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson <eerla...@redhat.com> >> wrote: >> >>> I am probably splitting hairs to finely, but I was considering the >>> difference between improvements to the jvm-side (py4j and the scala/java >>> code) that would make it easier to write the python layer ("python-friendly >>> api"), and actual improvements to the python layers ("friendly python api"). >>> >>> They're not mutually exclusive of course, and both worth working on. But >>> it's *possible* to improve either without the other. >>> >>> Stub files look like a great solution for type annotations, maybe even >>> if only python 3 is supported. >>> >>> I definitely agree that any decision to drop python 2 should not be >>> taken lightly. Anecdotally, I'm seeing an increase in python developers >>> announcing that they are dropping support for python 2 (and loving it). As >>> people have already pointed out, if we don't drop python 2 for spark 3.0, >>> we're stuck with it until 4.0, which would place spark in a >>> possibly-awkward position of supporting python 2 for some time after it >>> goes EOL. >>> >>> Under the current release cadence, spark 3.0 will land some time in >>> early 2019, which at that point will be mere months until EOL for py2. >>> >>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> >>>> >>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson <eerla...@redhat.com> >>>> wrote: >>>> >>>>> To be clear, is this about "python-friendly API" or "friendly python >>>>> API" ? >>>>> >>>> Well what would you consider to be different between those two >>>> statements? I think it would be good to be a bit more explicit, but I don't >>>> think we should necessarily limit ourselves. >>>> >>>>> >>>>> On the python side, it might be nice to take advantage of static >>>>> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might >>>>> be a good opportunity to jump the python-3-only train. >>>>> >>>> I think we can make types sort of work without ditching 2 (the types >>>> only would work in 3 but it would still function in 2). Ditching 2 entirely >>>> would be a big thing to consider, I honestly hadn't been considering that >>>> but it could be from just spending so much time maintaining a 2/3 code >>>> base. I'd suggest reaching out to to user@ before making that kind of >>>> change. >>>> >>>>> >>>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>> >>>>>> Since we're talking about Spark 3.0 in the near future (and since >>>>>> some recent conversation on a proposed change reminded me) I wanted to >>>>>> open >>>>>> up the floor and see if folks have any ideas on how we could make a more >>>>>> Python friendly API for 3.0? I'm planning on taking some time to look at >>>>>> other systems in the solution space and see what we might want to learn >>>>>> from them but I'd love to hear what other folks are thinking too. >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>> >>>>> >>> >