I agree, we should push forward on this. I think there is enough consensus to call a vote, unless someone else thinks that there is more to discuss?
rb On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> wrote: > Now that spark summit europe is over, are any committers interested in > moving forward with this? > > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark- > improvement-proposals.md > > Or are we going to let this discussion die on the vine? > > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda > <tomasz.gaw...@outlook.com> wrote: > > Maybe my mail was not clear enough. > > > > > > I didn't want to write "lets focus on Flink" or any other framework. The > > idea with benchmarks was to show two things: > > > > - why some people are doing bad PR for Spark > > > > - how - in easy way - we can change it and show that Spark is still on > the > > top > > > > > > No more, no less. Benchmarks will be helpful, but I don't think they're > the > > most important thing in Spark :) On the Spark main page there is still > chart > > "Spark vs Hadoop". It is important to show that framework is not the same > > Spark with other API, but much faster and optimized, comparable or even > > faster than other frameworks. > > > > > > About real-time streaming, I think it would be just good to see it in > Spark. > > I very like current Spark model, but many voices that says "we need > more" - > > community should listen also them and try to help them. With SIPs it > would > > be easier, I've just posted this example as "thing that may be changed > with > > SIP". > > > > > > I very like unification via Datasets, but there is a lot of algorithms > > inside - let's make easy API, but with strong background (articles, > > benchmarks, descriptions, etc) that shows that Spark is still modern > > framework. > > > > > > Maybe now my intention will be clearer :) As I said organizational ideas > > were already mentioned and I agree with them, my mail was just to show > some > > aspects from my side, so from theside of developer and person who is > trying > > to help others with Spark (via StackOverflow or other ways) > > > > > > Pozdrawiam / Best regards, > > > > Tomasz > > > > > > ________________________________ > > Od: Cody Koeninger <c...@koeninger.org> > > Wysłane: 17 października 2016 16:46 > > Do: Debasish Das > > DW: Tomasz Gawęda; dev@spark.apache.org > > Temat: Re: Spark Improvement Proposals > > > > I think narrowly focusing on Flink or benchmarks is missing my point. > > > > My point is evolve or die. Spark's governance and organization is > > hampering its ability to evolve technologically, and it needs to > > change. > > > > On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com> > > wrote: > >> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as > >> soon as I looked into it since compared to writing Java map-reduce and > >> Cascading code, Spark made writing distributed code fun...But now as we > >> went > >> deeper with Spark and real-time streaming use-case gets more prominent, > I > >> think it is time to bring a messaging model in conjunction with the > >> batch/micro-batch API that Spark is good at....akka-streams close > >> integration with spark micro-batching APIs looks like a great direction > to > >> stay in the game with Apache Flink...Spark 2.0 integrated streaming with > >> batch with the assumption is that micro-batching is sufficient to run > SQL > >> commands on stream but do we really have time to do SQL processing at > >> streaming data within 1-2 seconds ? > >> > >> After reading the email chain, I started to look into Flink > documentation > >> and if you compare it with Spark documentation, I think we have major > work > >> to do detailing out Spark internals so that more people from community > >> start > >> to take active role in improving the issues so that Spark stays strong > >> compared to Flink. > >> > >> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals > >> > >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals > >> > >> Spark is no longer an engine that works for micro-batch and batch...We > >> (and > >> I am sure many others) are pushing spark as an engine for stream and > query > >> processing.....we need to make it a state-of-the-art engine for high > speed > >> streaming data and user queries as well ! > >> > >> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda < > tomasz.gaw...@outlook.com> > >> wrote: > >>> > >>> Hi everyone, > >>> > >>> I'm quite late with my answer, but I think my suggestions may help a > >>> little bit. :) Many technical and organizational topics were mentioned, > >>> but I want to focus on these negative posts about Spark and about > >>> "haters" > >>> > >>> I really like Spark. Easy of use, speed, very good community - it's > >>> everything here. But Every project has to "flight" on "framework > market" > >>> to be still no 1. I'm following many Spark and Big Data communities, > >>> maybe my mail will inspire someone :) > >>> > >>> You (every Spark developer; so far I didn't have enough time to join > >>> contributing to Spark) has done excellent job. So why are some people > >>> saying that Flink (or other framework) is better, like it was posted in > >>> this mailing list? No, not because that framework is better in all > >>> cases.. In my opinion, many of these discussions where started after > >>> Flink marketing-like posts. Please look at StackOverflow "Flink vs > ...." > >>> posts, almost every post in "winned" by Flink. Answers are sometimes > >>> saying nothing about other frameworks, Flink's users (often PMC's) are > >>> just posting same information about real-time streaming, about delta > >>> iterations, etc. It look smart and very often it is marked as an aswer, > >>> even if - in my opinion - there wasn't told all the truth. > >>> > >>> > >>> My suggestion: I don't have enough money and knowledgle to perform huge > >>> performance test. Maybe some company, that supports Spark (Databricks, > >>> Cloudera? - just saying you're most visible in community :) ) could > >>> perform performance test of: > >>> > >>> - streaming engine - probably Spark will loose because of mini-batch > >>> model, however currently the difference should be much lower that in > >>> previous versions > >>> > >>> - Machine Learning models > >>> > >>> - batch jobs > >>> > >>> - Graph jobs > >>> > >>> - SQL queries > >>> > >>> People will see that Spark is envolving and is also a modern framework, > >>> because after reading posts mentioned above people may think "it is > >>> outdated, future is in framework X". > >>> > >>> Matei Zaharia posted excellent blog post about how Spark Structured > >>> Streaming beats every other framework in terms of easy-of-use and > >>> reliability. Performance tests, done in various environments (in > >>> example: laptop, small 2 node cluster, 10-node cluster, 20-node > >>> cluster), could be also very good marketing stuff to say "hey, you're > >>> telling that you're better, but Spark is still faster and is still > >>> getting even more fast!". This would be based on facts (just numbers), > >>> not opinions. It would be good for companies, for marketing puproses > and > >>> for every Spark developer > >>> > >>> > >>> Second: real-time streaming. I've written some time ago about real-time > >>> streaming support in Spark Structured Streaming. Some work should be > >>> done to make SSS more low-latency, but I think it's possible. Maybe > >>> Spark may look at Gearpump, which is also built on top of Akka? I don't > >>> know yet, it is good topic for SIP. However I think that Spark should > >>> have real-time streaming support. Currently I see many posts/comments > >>> that "Spark has too big latency". Spark Streaming is doing very good > >>> jobs with micro-batches, however I think it is possible to add also > more > >>> real-time processing. > >>> > >>> Other people said much more and I agree with proposal of SIP. I'm also > >>> happy that PMC's are not saying that they will not listen to users, but > >>> they really want to make Spark better for every user. > >>> > >>> > >>> What do you think about these two topics? Especially I'm looking at > Cody > >>> (who has started this topic) and PMCs :) > >>> > >>> Pozdrawiam / Best regards, > >>> > >>> Tomasz > >>> > >>> > >>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze: > >>> > I love Spark. 3 or 4 years ago it was the first distributed > computing > >>> > environment that felt usable, and the community was welcoming. > >>> > > >>> > But I just got back from the Reactive Summit, and this is what I > >>> > observed: > >>> > > >>> > - Industry leaders on stage making fun of Spark's streaming model > >>> > - Open source project leaders saying they looked at Spark's > governance > >>> > as a model to avoid > >>> > - Users saying they chose Flink because it was technically superior > >>> > and they couldn't get any answers on the Spark mailing lists > >>> > > >>> > Whether you agree with the substance of any of this, when this stuff > >>> > gets repeated enough people will believe it. > >>> > > >>> > Right now Spark is suffering from its own success, and I think > >>> > something needs to change. > >>> > > >>> > - We need a clear process for planning significant changes to the > >>> > codebase. > >>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly, > >>> > but you need a documented process with a clear outcome (e.g. a vote). > >>> > Passing around google docs after an implementation has largely been > >>> > decided on doesn't cut it. > >>> > > >>> > - All technical communication needs to be public. > >>> > Things getting decided in private chat, or when 1/3 of the committers > >>> > work for the same company and can just talk to each other... > >>> > Yes, it's convenient, but it's ultimately detrimental to the health > of > >>> > the project. > >>> > The way structured streaming has played out has shown that there are > >>> > significant technical blind spots (myself included). > >>> > One way to address that is to get the people who have domain > knowledge > >>> > involved, and listen to them. > >>> > > >>> > - We need more committers, and more committer diversity. > >>> > Per committer there are, what, more than 20 contributors and 10 new > >>> > jira tickets a month? It's too much. > >>> > There are people (I am _not_ referring to myself) who have been > around > >>> > for years, contributed thousands of lines of code, helped educate the > >>> > public around Spark... and yet are never going to be voted in. > >>> > > >>> > - We need a clear process for managing volunteer work. > >>> > Too many tickets sit around unowned, unclosed, uncertain. > >>> > If someone proposed something and it isn't up to snuff, tell them and > >>> > close it. It may be blunt, but it's clearer than "silent no". > >>> > If someone wants to work on something, let them own the ticket and > set > >>> > a deadline. If they don't meet it, close it or reassign it. > >>> > > >>> > This is not me putting on an Apache Bureaucracy hat. This is me > >>> > saying, as a fellow hacker and loyal dissenter, something is wrong > >>> > with the culture and process. > >>> > > >>> > Please, let's change it. > >>> > > >>> > ------------------------------------------------------------ > --------- > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix