Odp.: Spark Improvement Proposals

Tomasz Gawęda Mon, 17 Oct 2016 08:05:48 -0700

Maybe my mail was not clear enough.


I didn't want to write "lets focus on Flink" or any other framework. The idea 
with benchmarks was to show two things:

- why some people are doing bad PR for Spark

- how - in easy way - we can change it and show that Spark is still on the top


No more, no less. Benchmarks will be helpful, but I don't think they're the 
most important thing in Spark :) On the Spark main page there is still chart 
"Spark vs Hadoop". It is important to show that framework is not the same Spark 
with other API, but much faster and optimized, comparable or even faster than 
other frameworks.


About real-time streaming, I think it would be just good to see it in Spark. I 
very like current Spark model, but many voices that says "we need more" - 
community should listen also them and try to help them. With SIPs it would be 
easier, I've just posted this example as "thing that may be changed with SIP".


I very like unification via Datasets, but there is a lot of algorithms inside - 
let's make easy API, but with strong background (articles, benchmarks, 
descriptions, etc) that shows that Spark is still modern framework.


Maybe now my intention will be clearer :) As I said organizational ideas were 
already mentioned and I agree with them, my mail was just to show some aspects 
from my side, so from theside of developer and person who is trying to help 
others with Spark (via StackOverflow or other ways)


Pozdrawiam / Best regards,

Tomasz


________________________________
Od: Cody Koeninger <[email protected]>
Wysłane: 17 października 2016 16:46
Do: Debasish Das
DW: Tomasz Gawęda; [email protected]
Temat: Re: Spark Improvement Proposals

I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[email protected]> wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good at....akka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.....we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <[email protected]>
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - just saying you're most visible in community :) ) could
>> perform performance test of:
>>
>> - streaming engine - probably Spark will loose because of mini-batch
>> model, however currently the difference should be much lower that in
>> previous versions
>>
>> - Machine Learning models
>>
>> - batch jobs
>>
>> - Graph jobs
>>
>> - SQL queries
>>
>> People will see that Spark is envolving and is also a modern framework,
>> because after reading posts mentioned above people may think "it is
>> outdated, future is in framework X".
>>
>> Matei Zaharia posted excellent blog post about how Spark Structured
>> Streaming beats every other framework in terms of easy-of-use and
>> reliability. Performance tests, done in various environments (in
>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> cluster), could be also very good marketing stuff to say "hey, you're
>> telling that you're better, but Spark is still faster and is still
>> getting even more fast!". This would be based on facts (just numbers),
>> not opinions. It would be good for companies, for marketing puproses and
>> for every Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". Spark Streaming is doing very good
>> jobs with micro-batches, however I think it is possible to add also more
>> real-time processing.
>>
>> Other people said much more and I agree with proposal of SIP. I'm also
>> happy that PMC's are not saying that they will not listen to users, but
>> they really want to make Spark better for every user.
>>
>>
>> What do you think about these two topics? Especially I'm looking at Cody
>> (who has started this topic) and PMCs :)
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>> > environment that felt usable, and the community was welcoming.
>> >
>> > But I just got back from the Reactive Summit, and this is what I
>> > observed:
>> >
>> > - Industry leaders on stage making fun of Spark's streaming model
>> > - Open source project leaders saying they looked at Spark's governance
>> > as a model to avoid
>> > - Users saying they chose Flink because it was technically superior
>> > and they couldn't get any answers on the Spark mailing lists
>> >
>> > Whether you agree with the substance of any of this, when this stuff
>> > gets repeated enough people will believe it.
>> >
>> > Right now Spark is suffering from its own success, and I think
>> > something needs to change.
>> >
>> > - We need a clear process for planning significant changes to the
>> > codebase.
>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> > but you need a documented process with a clear outcome (e.g. a vote).
>> > Passing around google docs after an implementation has largely been
>> > decided on doesn't cut it.
>> >
>> > - All technical communication needs to be public.
>> > Things getting decided in private chat, or when 1/3 of the committers
>> > work for the same company and can just talk to each other...
>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>> > the project.
>> > The way structured streaming has played out has shown that there are
>> > significant technical blind spots (myself included).
>> > One way to address that is to get the people who have domain knowledge
>> > involved, and listen to them.
>> >
>> > - We need more committers, and more committer diversity.
>> > Per committer there are, what, more than 20 contributors and 10 new
>> > jira tickets a month?  It's too much.
>> > There are people (I am _not_ referring to myself) who have been around
>> > for years, contributed thousands of lines of code, helped educate the
>> > public around Spark... and yet are never going to be voted in.
>> >
>> > - We need a clear process for managing volunteer work.
>> > Too many tickets sit around unowned, unclosed, uncertain.
>> > If someone proposed something and it isn't up to snuff, tell them and
>> > close it.  It may be blunt, but it's clearer than "silent no".
>> > If someone wants to work on something, let them own the ticket and set
>> > a deadline. If they don't meet it, close it or reassign it.
>> >
>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>> > with the culture and process.
>> >
>> > Please, let's change it.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [email protected]
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Odp.: Spark Improvement Proposals

Reply via email to