Re: Spark Improvement Proposals

Tomasz Gawęda Sun, 16 Oct 2016 13:31:25 -0700

Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a 
little bit. :) Many technical and organizational topics were mentioned, 
but I want to focus on these negative posts about Spark and about "haters"


I really like Spark. Easy of use, speed, very good community - it's 
everything here. But Every project has to "flight" on "framework market" 
to be still no 1. I'm following many Spark and Big Data communities, 
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join 
contributing to Spark) has done excellent job. So why are some people 
saying that Flink (or other framework) is better, like it was posted in 
this mailing list? No, not because that framework is better in all 
cases.. In my opinion, many of these discussions where started after 
Flink marketing-like posts. Please look at StackOverflow "Flink vs ...." 
posts, almost every post in "winned" by Flink. Answers are sometimes 
saying nothing about other frameworks, Flink's users (often PMC's) are 
just posting same information about real-time streaming, about delta 
iterations, etc. It look smart and very often it is marked as an aswer, 
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge 
performance test. Maybe some company, that supports Spark (Databricks, 
Cloudera? - just saying you're most visible in community :) ) could 
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch 
model, however currently the difference should be much lower that in 
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework, 
because after reading posts mentioned above people may think "it is 
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured 
Streaming beats every other framework in terms of easy-of-use and 
reliability. Performance tests, done in various environments (in 
example: laptop, small 2 node cluster, 10-node cluster, 20-node 
cluster), could be also very good marketing stuff to say "hey, you're 
telling that you're better, but Spark is still faster and is still 
getting even more fast!". This would be based on facts (just numbers), 
not opinions. It would be good for companies, for marketing puproses and 
for every Spark developer


Second: real-time streaming. I've written some time ago about real-time 
streaming support in Spark Structured Streaming. Some work should be 
done to make SSS more low-latency, but I think it's possible. Maybe 
Spark may look at Gearpump, which is also built on top of Akka? I don't 
know yet, it is good topic for SIP. However I think that Spark should 
have real-time streaming support. Currently I see many posts/comments 
that "Spark has too big latency". Spark Streaming is doing very good 
jobs with micro-batches, however I think it is possible to add also more 
real-time processing.

Other people said much more and I agree with proposal of SIP. I'm also 
happy that PMC's are not saying that they will not listen to users, but 
they really want to make Spark better for every user.


What do you think about these two topics? Especially I'm looking at Cody 
(who has started this topic) and PMCs :)

Pozdrawiam / Best regards,

Tomasz


W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>

Re: Spark Improvement Proposals

Reply via email to