Re: Spark Improvement Proposals

Xiao Li Thu, 06 Oct 2016 21:54:07 -0700

Let us continue to improve Apache Spark!

I volunteer to go through all the SQL-related open JIRAs.


Xiao Li

2016-10-06 21:14 GMT-07:00 Matei Zaharia <[email protected]>:
> Hey Cody,
>
> Thanks for bringing these things up. You're talking about quite a few 
> different things here, but let me get to them each in turn.
>
> 1) About technical / design discussion -- I fully agree that everything big 
> should go through a lot of review, and I like the idea of a more formal way 
> to propose and comment on larger features. So far, all of this has been done 
> through JIRA, but as a start, maybe marking JIRAs as large (we often use 
> Umbrella for this) and also opening a thread on the list about each such JIRA 
> would help. For Structured Streaming in particular, FWIW, there was a pretty 
> complete doc on the proposed semantics at 
> https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true 
> that other things such as the Kafka source for it didn't have as much design 
> on JIRA. Nonetheless, this component is still early on and there's still a 
> lot of time to change it, which is happening.
>
> 2) About what people say at Reactive Summit -- there will always be trolls, 
> but just ignore them and build a great project. Those of us involved in the 
> project for a while have long seen similar stuff, e.g. a prominent company 
> saying Spark doesn't scale past 100 nodes when there were many documented 
> instances to the contrary, and the best answer is to just make the project 
> better. This same company, if you read their website now, recommends Apache 
> Spark for most anything. For streaming in particular, there is a lot of 
> confusion because many of the concepts aren't well-defined (e.g. what is "at 
> least once", etc), and it's also a crowded space. But Spark Streaming 
> prioritizes a few things that it does very well: correctness (you can easily 
> tell what the app will do, and it does the same thing despite failures), ease 
> of programming (which also requires correctness), and scalability. We should 
> of course both explain what it does in more places and work on improving it 
> where needed (e.g. adding a higher level API with Structured Streaming and 
> built-in primitives for external timestamps).
>
> 3) About number and diversity of committers -- the PMC is always working to 
> expand these, and you should email people on the PMC (or even the whole list) 
> if you have people you'd like to propose. In general I think nearly all 
> committers added in the past year were from organizations that haven't long 
> been involved in Spark, and the number of committers continues to grow pretty 
> fast.
>
> 4) Finally, about better organizing JIRA, marking dead issues, etc, this 
> would be great and I think we just need a concrete proposal for how to do it. 
> It would be best to point to an existing process that someone else has used 
> here BTW so that we can see it in action.
>
> Matei
>
>> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <[email protected]> wrote:
>>
>> I love Spark.  3 or 4 years ago it was the first distributed computing
>> environment that felt usable, and the community was welcoming.
>>
>> But I just got back from the Reactive Summit, and this is what I observed:
>>
>> - Industry leaders on stage making fun of Spark's streaming model
>> - Open source project leaders saying they looked at Spark's governance
>> as a model to avoid
>> - Users saying they chose Flink because it was technically superior
>> and they couldn't get any answers on the Spark mailing lists
>>
>> Whether you agree with the substance of any of this, when this stuff
>> gets repeated enough people will believe it.
>>
>> Right now Spark is suffering from its own success, and I think
>> something needs to change.
>>
>> - We need a clear process for planning significant changes to the codebase.
>> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> but you need a documented process with a clear outcome (e.g. a vote).
>> Passing around google docs after an implementation has largely been
>> decided on doesn't cut it.
>>
>> - All technical communication needs to be public.
>> Things getting decided in private chat, or when 1/3 of the committers
>> work for the same company and can just talk to each other...
>> Yes, it's convenient, but it's ultimately detrimental to the health of
>> the project.
>> The way structured streaming has played out has shown that there are
>> significant technical blind spots (myself included).
>> One way to address that is to get the people who have domain knowledge
>> involved, and listen to them.
>>
>> - We need more committers, and more committer diversity.
>> Per committer there are, what, more than 20 contributors and 10 new
>> jira tickets a month?  It's too much.
>> There are people (I am _not_ referring to myself) who have been around
>> for years, contributed thousands of lines of code, helped educate the
>> public around Spark... and yet are never going to be voted in.
>>
>> - We need a clear process for managing volunteer work.
>> Too many tickets sit around unowned, unclosed, uncertain.
>> If someone proposed something and it isn't up to snuff, tell them and
>> close it.  It may be blunt, but it's clearer than "silent no".
>> If someone wants to work on something, let them own the ticket and set
>> a deadline. If they don't meet it, close it or reassign it.
>>
>> This is not me putting on an Apache Bureaucracy hat.  This is me
>> saying, as a fellow hacker and loyal dissenter, something is wrong
>> with the culture and process.
>>
>> Please, let's change it.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark Improvement Proposals

Reply via email to