Hi everyone, I'm quite late with my answer, but I think my suggestions may help a little bit. :) Many technical and organizational topics were mentioned, but I want to focus on these negative posts about Spark and about "haters"
I really like Spark. Easy of use, speed, very good community - it's everything here. But Every project has to "flight" on "framework market" to be still no 1. I'm following many Spark and Big Data communities, maybe my mail will inspire someone :) You (every Spark developer; so far I didn't have enough time to join contributing to Spark) has done excellent job. So why are some people saying that Flink (or other framework) is better, like it was posted in this mailing list? No, not because that framework is better in all cases.. In my opinion, many of these discussions where started after Flink marketing-like posts. Please look at StackOverflow "Flink vs ...." posts, almost every post in "winned" by Flink. Answers are sometimes saying nothing about other frameworks, Flink's users (often PMC's) are just posting same information about real-time streaming, about delta iterations, etc. It look smart and very often it is marked as an aswer, even if - in my opinion - there wasn't told all the truth. My suggestion: I don't have enough money and knowledgle to perform huge performance test. Maybe some company, that supports Spark (Databricks, Cloudera? - just saying you're most visible in community :) ) could perform performance test of: - streaming engine - probably Spark will loose because of mini-batch model, however currently the difference should be much lower that in previous versions - Machine Learning models - batch jobs - Graph jobs - SQL queries People will see that Spark is envolving and is also a modern framework, because after reading posts mentioned above people may think "it is outdated, future is in framework X". Matei Zaharia posted excellent blog post about how Spark Structured Streaming beats every other framework in terms of easy-of-use and reliability. Performance tests, done in various environments (in example: laptop, small 2 node cluster, 10-node cluster, 20-node cluster), could be also very good marketing stuff to say "hey, you're telling that you're better, but Spark is still faster and is still getting even more fast!". This would be based on facts (just numbers), not opinions. It would be good for companies, for marketing puproses and for every Spark developer Second: real-time streaming. I've written some time ago about real-time streaming support in Spark Structured Streaming. Some work should be done to make SSS more low-latency, but I think it's possible. Maybe Spark may look at Gearpump, which is also built on top of Akka? I don't know yet, it is good topic for SIP. However I think that Spark should have real-time streaming support. Currently I see many posts/comments that "Spark has too big latency". Spark Streaming is doing very good jobs with micro-batches, however I think it is possible to add also more real-time processing. Other people said much more and I agree with proposal of SIP. I'm also happy that PMC's are not saying that they will not listen to users, but they really want to make Spark better for every user. What do you think about these two topics? Especially I'm looking at Cody (who has started this topic) and PMCs :) Pozdrawiam / Best regards, Tomasz W dniu 2016-10-07 o 04:51, Cody Koeninger pisze: > I love Spark. 3 or 4 years ago it was the first distributed computing > environment that felt usable, and the community was welcoming. > > But I just got back from the Reactive Summit, and this is what I observed: > > - Industry leaders on stage making fun of Spark's streaming model > - Open source project leaders saying they looked at Spark's governance > as a model to avoid > - Users saying they chose Flink because it was technically superior > and they couldn't get any answers on the Spark mailing lists > > Whether you agree with the substance of any of this, when this stuff > gets repeated enough people will believe it. > > Right now Spark is suffering from its own success, and I think > something needs to change. > > - We need a clear process for planning significant changes to the codebase. > I'm not saying you need to adopt Kafka Improvement Proposals exactly, > but you need a documented process with a clear outcome (e.g. a vote). > Passing around google docs after an implementation has largely been > decided on doesn't cut it. > > - All technical communication needs to be public. > Things getting decided in private chat, or when 1/3 of the committers > work for the same company and can just talk to each other... > Yes, it's convenient, but it's ultimately detrimental to the health of > the project. > The way structured streaming has played out has shown that there are > significant technical blind spots (myself included). > One way to address that is to get the people who have domain knowledge > involved, and listen to them. > > - We need more committers, and more committer diversity. > Per committer there are, what, more than 20 contributors and 10 new > jira tickets a month? It's too much. > There are people (I am _not_ referring to myself) who have been around > for years, contributed thousands of lines of code, helped educate the > public around Spark... and yet are never going to be voted in. > > - We need a clear process for managing volunteer work. > Too many tickets sit around unowned, unclosed, uncertain. > If someone proposed something and it isn't up to snuff, tell them and > close it. It may be blunt, but it's clearer than "silent no". > If someone wants to work on something, let them own the ticket and set > a deadline. If they don't meet it, close it or reassign it. > > This is not me putting on an Apache Bureaucracy hat. This is me > saying, as a fellow hacker and loyal dissenter, something is wrong > with the culture and process. > > Please, let's change it. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >