Re: Odp.: Spark Improvement Proposals

Ryan Blue Mon, 31 Oct 2016 14:55:40 -0700

I agree, we should push forward on this. I think there is enough consensus
to call a vote, unless someone else thinks that there is more to discuss?


rb

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> <tomasz.gaw...@outlook.com> wrote:
> > Maybe my mail was not clear enough.
> >
> >
> > I didn't want to write "lets focus on Flink" or any other framework. The
> > idea with benchmarks was to show two things:
> >
> > - why some people are doing bad PR for Spark
> >
> > - how - in easy way - we can change it and show that Spark is still on
> the
> > top
> >
> >
> > No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> > most important thing in Spark :) On the Spark main page there is still
> chart
> > "Spark vs Hadoop". It is important to show that framework is not the same
> > Spark with other API, but much faster and optimized, comparable or even
> > faster than other frameworks.
> >
> >
> > About real-time streaming, I think it would be just good to see it in
> Spark.
> > I very like current Spark model, but many voices that says "we need
> more" -
> > community should listen also them and try to help them. With SIPs it
> would
> > be easier, I've just posted this example as "thing that may be changed
> with
> > SIP".
> >
> >
> > I very like unification via Datasets, but there is a lot of algorithms
> > inside - let's make easy API, but with strong background (articles,
> > benchmarks, descriptions, etc) that shows that Spark is still modern
> > framework.
> >
> >
> > Maybe now my intention will be clearer :) As I said organizational ideas
> > were already mentioned and I agree with them, my mail was just to show
> some
> > aspects from my side, so from theside of developer and person who is
> trying
> > to help others with Spark (via StackOverflow or other ways)
> >
> >
> > Pozdrawiam / Best regards,
> >
> > Tomasz
> >
> >
> > ________________________________
> > Od: Cody Koeninger <c...@koeninger.org>
> > Wysłane: 17 października 2016 16:46
> > Do: Debasish Das
> > DW: Tomasz Gawęda; dev@spark.apache.org
> > Temat: Re: Spark Improvement Proposals
> >
> > I think narrowly focusing on Flink or benchmarks is missing my point.
> >
> > My point is evolve or die.  Spark's governance and organization is
> > hampering its ability to evolve technologically, and it needs to
> > change.
> >
> > On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com>
> > wrote:
> >> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> >> soon as I looked into it since compared to writing Java map-reduce and
> >> Cascading code, Spark made writing distributed code fun...But now as we
> >> went
> >> deeper with Spark and real-time streaming use-case gets more prominent,
> I
> >> think it is time to bring a messaging model in conjunction with the
> >> batch/micro-batch API that Spark is good at....akka-streams close
> >> integration with spark micro-batching APIs looks like a great direction
> to
> >> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> >> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >> commands on stream but do we really have time to do SQL processing at
> >> streaming data within 1-2 seconds ?
> >>
> >> After reading the email chain, I started to look into Flink
> documentation
> >> and if you compare it with Spark documentation, I think we have major
> work
> >> to do detailing out Spark internals so that more people from community
> >> start
> >> to take active role in improving the issues so that Spark stays strong
> >> compared to Flink.
> >>
> >> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>
> >> Spark is no longer an engine that works for micro-batch and batch...We
> >> (and
> >> I am sure many others) are pushing spark as an engine for stream and
> query
> >> processing.....we need to make it a state-of-the-art engine for high
> speed
> >> streaming data and user queries as well !
> >>
> >> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <
> tomasz.gaw...@outlook.com>
> >> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> I'm quite late with my answer, but I think my suggestions may help a
> >>> little bit. :) Many technical and organizational topics were mentioned,
> >>> but I want to focus on these negative posts about Spark and about
> >>> "haters"
> >>>
> >>> I really like Spark. Easy of use, speed, very good community - it's
> >>> everything here. But Every project has to "flight" on "framework
> market"
> >>> to be still no 1. I'm following many Spark and Big Data communities,
> >>> maybe my mail will inspire someone :)
> >>>
> >>> You (every Spark developer; so far I didn't have enough time to join
> >>> contributing to Spark) has done excellent job. So why are some people
> >>> saying that Flink (or other framework) is better, like it was posted in
> >>> this mailing list? No, not because that framework is better in all
> >>> cases.. In my opinion, many of these discussions where started after
> >>> Flink marketing-like posts. Please look at StackOverflow "Flink vs
> ...."
> >>> posts, almost every post in "winned" by Flink. Answers are sometimes
> >>> saying nothing about other frameworks, Flink's users (often PMC's) are
> >>> just posting same information about real-time streaming, about delta
> >>> iterations, etc. It look smart and very often it is marked as an aswer,
> >>> even if - in my opinion - there wasn't told all the truth.
> >>>
> >>>
> >>> My suggestion: I don't have enough money and knowledgle to perform huge
> >>> performance test. Maybe some company, that supports Spark (Databricks,
> >>> Cloudera? - just saying you're most visible in community :) ) could
> >>> perform performance test of:
> >>>
> >>> - streaming engine - probably Spark will loose because of mini-batch
> >>> model, however currently the difference should be much lower that in
> >>> previous versions
> >>>
> >>> - Machine Learning models
> >>>
> >>> - batch jobs
> >>>
> >>> - Graph jobs
> >>>
> >>> - SQL queries
> >>>
> >>> People will see that Spark is envolving and is also a modern framework,
> >>> because after reading posts mentioned above people may think "it is
> >>> outdated, future is in framework X".
> >>>
> >>> Matei Zaharia posted excellent blog post about how Spark Structured
> >>> Streaming beats every other framework in terms of easy-of-use and
> >>> reliability. Performance tests, done in various environments (in
> >>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> >>> cluster), could be also very good marketing stuff to say "hey, you're
> >>> telling that you're better, but Spark is still faster and is still
> >>> getting even more fast!". This would be based on facts (just numbers),
> >>> not opinions. It would be good for companies, for marketing puproses
> and
> >>> for every Spark developer
> >>>
> >>>
> >>> Second: real-time streaming. I've written some time ago about real-time
> >>> streaming support in Spark Structured Streaming. Some work should be
> >>> done to make SSS more low-latency, but I think it's possible. Maybe
> >>> Spark may look at Gearpump, which is also built on top of Akka? I don't
> >>> know yet, it is good topic for SIP. However I think that Spark should
> >>> have real-time streaming support. Currently I see many posts/comments
> >>> that "Spark has too big latency". Spark Streaming is doing very good
> >>> jobs with micro-batches, however I think it is possible to add also
> more
> >>> real-time processing.
> >>>
> >>> Other people said much more and I agree with proposal of SIP. I'm also
> >>> happy that PMC's are not saying that they will not listen to users, but
> >>> they really want to make Spark better for every user.
> >>>
> >>>
> >>> What do you think about these two topics? Especially I'm looking at
> Cody
> >>> (who has started this topic) and PMCs :)
> >>>
> >>> Pozdrawiam / Best regards,
> >>>
> >>> Tomasz
> >>>
> >>>
> >>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> >>> > I love Spark.  3 or 4 years ago it was the first distributed
> computing
> >>> > environment that felt usable, and the community was welcoming.
> >>> >
> >>> > But I just got back from the Reactive Summit, and this is what I
> >>> > observed:
> >>> >
> >>> > - Industry leaders on stage making fun of Spark's streaming model
> >>> > - Open source project leaders saying they looked at Spark's
> governance
> >>> > as a model to avoid
> >>> > - Users saying they chose Flink because it was technically superior
> >>> > and they couldn't get any answers on the Spark mailing lists
> >>> >
> >>> > Whether you agree with the substance of any of this, when this stuff
> >>> > gets repeated enough people will believe it.
> >>> >
> >>> > Right now Spark is suffering from its own success, and I think
> >>> > something needs to change.
> >>> >
> >>> > - We need a clear process for planning significant changes to the
> >>> > codebase.
> >>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> >>> > but you need a documented process with a clear outcome (e.g. a vote).
> >>> > Passing around google docs after an implementation has largely been
> >>> > decided on doesn't cut it.
> >>> >
> >>> > - All technical communication needs to be public.
> >>> > Things getting decided in private chat, or when 1/3 of the committers
> >>> > work for the same company and can just talk to each other...
> >>> > Yes, it's convenient, but it's ultimately detrimental to the health
> of
> >>> > the project.
> >>> > The way structured streaming has played out has shown that there are
> >>> > significant technical blind spots (myself included).
> >>> > One way to address that is to get the people who have domain
> knowledge
> >>> > involved, and listen to them.
> >>> >
> >>> > - We need more committers, and more committer diversity.
> >>> > Per committer there are, what, more than 20 contributors and 10 new
> >>> > jira tickets a month?  It's too much.
> >>> > There are people (I am _not_ referring to myself) who have been
> around
> >>> > for years, contributed thousands of lines of code, helped educate the
> >>> > public around Spark... and yet are never going to be voted in.
> >>> >
> >>> > - We need a clear process for managing volunteer work.
> >>> > Too many tickets sit around unowned, unclosed, uncertain.
> >>> > If someone proposed something and it isn't up to snuff, tell them and
> >>> > close it.  It may be blunt, but it's clearer than "silent no".
> >>> > If someone wants to work on something, let them own the ticket and
> set
> >>> > a deadline. If they don't meet it, close it or reassign it.
> >>> >
> >>> > This is not me putting on an Apache Bureaucracy hat.  This is me
> >>> > saying, as a fellow hacker and loyal dissenter, something is wrong
> >>> > with the culture and process.
> >>> >
> >>> > Please, let's change it.
> >>> >
> >>> > ------------------------------------------------------------
> ---------
> >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Odp.: Spark Improvement Proposals

Reply via email to