Let us continue to improve Apache Spark! I volunteer to go through all the SQL-related open JIRAs.
Xiao Li 2016-10-06 21:14 GMT-07:00 Matei Zaharia <matei.zaha...@gmail.com>: > Hey Cody, > > Thanks for bringing these things up. You're talking about quite a few > different things here, but let me get to them each in turn. > > 1) About technical / design discussion -- I fully agree that everything big > should go through a lot of review, and I like the idea of a more formal way > to propose and comment on larger features. So far, all of this has been done > through JIRA, but as a start, maybe marking JIRAs as large (we often use > Umbrella for this) and also opening a thread on the list about each such JIRA > would help. For Structured Streaming in particular, FWIW, there was a pretty > complete doc on the proposed semantics at > https://issues.apache.org/jira/browse/SPARK-8360 since March. But it's true > that other things such as the Kafka source for it didn't have as much design > on JIRA. Nonetheless, this component is still early on and there's still a > lot of time to change it, which is happening. > > 2) About what people say at Reactive Summit -- there will always be trolls, > but just ignore them and build a great project. Those of us involved in the > project for a while have long seen similar stuff, e.g. a prominent company > saying Spark doesn't scale past 100 nodes when there were many documented > instances to the contrary, and the best answer is to just make the project > better. This same company, if you read their website now, recommends Apache > Spark for most anything. For streaming in particular, there is a lot of > confusion because many of the concepts aren't well-defined (e.g. what is "at > least once", etc), and it's also a crowded space. But Spark Streaming > prioritizes a few things that it does very well: correctness (you can easily > tell what the app will do, and it does the same thing despite failures), ease > of programming (which also requires correctness), and scalability. We should > of course both explain what it does in more places and work on improving it > where needed (e.g. adding a higher level API with Structured Streaming and > built-in primitives for external timestamps). > > 3) About number and diversity of committers -- the PMC is always working to > expand these, and you should email people on the PMC (or even the whole list) > if you have people you'd like to propose. In general I think nearly all > committers added in the past year were from organizations that haven't long > been involved in Spark, and the number of committers continues to grow pretty > fast. > > 4) Finally, about better organizing JIRA, marking dead issues, etc, this > would be great and I think we just need a concrete proposal for how to do it. > It would be best to point to an existing process that someone else has used > here BTW so that we can see it in action. > > Matei > >> On Oct 6, 2016, at 7:51 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> I love Spark. 3 or 4 years ago it was the first distributed computing >> environment that felt usable, and the community was welcoming. >> >> But I just got back from the Reactive Summit, and this is what I observed: >> >> - Industry leaders on stage making fun of Spark's streaming model >> - Open source project leaders saying they looked at Spark's governance >> as a model to avoid >> - Users saying they chose Flink because it was technically superior >> and they couldn't get any answers on the Spark mailing lists >> >> Whether you agree with the substance of any of this, when this stuff >> gets repeated enough people will believe it. >> >> Right now Spark is suffering from its own success, and I think >> something needs to change. >> >> - We need a clear process for planning significant changes to the codebase. >> I'm not saying you need to adopt Kafka Improvement Proposals exactly, >> but you need a documented process with a clear outcome (e.g. a vote). >> Passing around google docs after an implementation has largely been >> decided on doesn't cut it. >> >> - All technical communication needs to be public. >> Things getting decided in private chat, or when 1/3 of the committers >> work for the same company and can just talk to each other... >> Yes, it's convenient, but it's ultimately detrimental to the health of >> the project. >> The way structured streaming has played out has shown that there are >> significant technical blind spots (myself included). >> One way to address that is to get the people who have domain knowledge >> involved, and listen to them. >> >> - We need more committers, and more committer diversity. >> Per committer there are, what, more than 20 contributors and 10 new >> jira tickets a month? It's too much. >> There are people (I am _not_ referring to myself) who have been around >> for years, contributed thousands of lines of code, helped educate the >> public around Spark... and yet are never going to be voted in. >> >> - We need a clear process for managing volunteer work. >> Too many tickets sit around unowned, unclosed, uncertain. >> If someone proposed something and it isn't up to snuff, tell them and >> close it. It may be blunt, but it's clearer than "silent no". >> If someone wants to work on something, let them own the ticket and set >> a deadline. If they don't meet it, close it or reassign it. >> >> This is not me putting on an Apache Bureaucracy hat. This is me >> saying, as a fellow hacker and loyal dissenter, something is wrong >> with the culture and process. >> >> Please, let's change it. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org