+1 on all counts (consensus, time bound, define roles) I can update the doc in the next few days and share back. Then maybe we can just officially vote on this. As Tim suggested, we might not get it 100% right the first time and would need to re-iterate. But that's fine.
On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com> wrote: > Hi Cody, > thank you for bringing up this topic, I agree it is very important to keep > a cohesive community around some common, fluid goals. Here are a few > comments about the current document: > > 1. name: it should not overlap with an existing one such as SIP. Can you > imagine someone trying to discuss a scala spore proposal for spark? > "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP > sounds great. > > 2. roles: at a high level, SPIPs are meant to reach consensus for > technical decisions with a lasting impact. As such, the template should > emphasize the role of the various parties during this process: > > - the SPIP author is responsible for building consensus. She is the > champion driving the process forward and is responsible for ensuring that > the SPIP follows the general guidelines. The author should be identified in > the SPIP. The authorship of a SPIP can be transferred if the current author > is not interested and someone else wants to move the SPIP forward. There > should probably be 2-3 authors at most for each SPIP. > > - someone with voting power should probably shepherd the SPIP (and be > recorded as such): ensuring that the final decision over the SPIP is > recorded (rejected, accepted, etc.), and advising about the technical > quality of the SPIP: this person need not be a champion for the SPIP or > contribute to it, but rather makes sure it stands a chance of being > approved when the vote happens. Also, if the author cannot find anyone who > would want to take this role, this proposal is likely to be rejected anyway. > > - users, committers, contributors have the roles already outlined in the > document > > 3. timeline: ideally, once a SPIP has been offered for voting, it should > move swiftly into either being accepted or rejected, so that we do not end > up with a distracting long tail of half-hearted proposals. > > These rules are meant to be flexible, but the current document should be > clear about who is in charge of a SPIP, and the state it is currently in. > > We have had long discussions over some very important questions such as > approval. I do not have an opinion on these, but why not make a pick and > reevaluate this decision later? This is not a binding process at this point. > > Tim > > > On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org> wrote: > >> I don't have a concern about voting vs consensus. >> >> I have a concern that whatever the decision making process is, it is >> explicitly announced on the ticket for the given proposal, with an explicit >> deadline, and an explicit outcome. >> >> >> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com> >> wrote: >> >>> I'm also in favor of this. Thanks for your persistence Cody. >>> >>> My take on the specific issues Joseph mentioned: >>> >>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made >>> earlier for consensus: >>> >>> > Majority vs consensus: My rationale is that I don't think we want to >>> consider a proposal approved if it had objections serious enough that >>> committers down-voted (or PMC depending on who gets a vote). If these >>> proposals are like PEPs, then they represent a significant amount of >>> community effort and I wouldn't want to move forward if up to half of the >>> community thinks it's an untenable idea. >>> >>> 2) Design doc template -- agree this would be useful, but also seems >>> totally orthogonal to moving forward on the SIP proposal. >>> >>> 3) agree w/ Joseph's proposal for updating the template. >>> >>> One small addition: >>> >>> 4) Deciding on a name -- minor, but I think its wroth disambiguating >>> from Scala's SIPs, and the best proposal I've heard is "SPIP". At least, >>> no one has objected. (don't care enough that I'd object to anything else, >>> though.) >>> >>> >>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jos...@databricks.com> >>> wrote: >>> >>>> Hi Cody, >>>> >>>> Thanks for being persistent about this. I too would like to see this >>>> happen. Reviewing the thread, it sounds like the main things remaining >>>> are: >>>> * Decide about a few issues >>>> * Finalize the doc(s) >>>> * Vote on this proposal >>>> >>>> Issues & TODOs: >>>> >>>> (1) The main issue I see above is voting vs. consensus. I have little >>>> preference here. It sounds like something which could be tailored based on >>>> whether we see too many or too few SIPs being approved. >>>> >>>> (2) Design doc template (This would be great to have for Spark >>>> regardless of this SIP discussion.) >>>> * Reynold, are you still putting this together? >>>> >>>> (3) Template cleanups. Listing some items mentioned above + a new one >>>> w.r.t. Reynold's draft >>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >>>> : >>>> * Reinstate the "Where" section with links to current and past SIPs >>>> * Add field for stating explicit deadlines for approval >>>> * Add field for stating Author & Committer shepherd >>>> >>>> Thanks all! >>>> Joseph >>>> >>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org> >>>> wrote: >>>> >>>>> I'm bumping this one more time for the new year, and then I'm giving >>>>> up. >>>>> >>>>> Please, fix your process, even if it isn't exactly the way I suggested. >>>>> >>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote: >>>>> > On lazy consensus as opposed to voting: >>>>> > >>>>> > First, why lazy consensus? The proposal was for consensus, which is >>>>> at least >>>>> > three +1 votes and no vetos. Consensus has no losing side, it >>>>> requires >>>>> > getting to a point where there is agreement. Isn't that agreement >>>>> what we >>>>> > want to achieve with these proposals? >>>>> > >>>>> > Second, lazy consensus only removes the requirement for three +1 >>>>> votes. Why >>>>> > would we not want at least three committers to think something is a >>>>> good >>>>> > idea before adopting the proposal? >>>>> > >>>>> > rb >>>>> > >>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >> >>>>> >> So there are some minor things (the Where section heading appears to >>>>> >> be dropped; wherever this document is posted it needs to actually >>>>> link >>>>> >> to a jira filter showing current / past SIPs) but it doesn't look >>>>> like >>>>> >> I can comment on the google doc. >>>>> >> >>>>> >> The major substantive issue that I have is that this version is >>>>> >> significantly less clear as to the outcome of an SIP. >>>>> >> >>>>> >> The apache example of lazy consensus at >>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an >>>>> >> explicit announcement of an explicit deadline, which I think are >>>>> >> necessary for clarity. >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >> > It turned out suggested edits (trackable) don't show up for >>>>> non-owners, >>>>> >> > so >>>>> >> > I've just merged all the edits in place. It should be visible now. >>>>> >> > >>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com >>>>> > >>>>> >> > wrote: >>>>> >> >> >>>>> >> >> Oops. Let me try figure that out. >>>>> >> >> >>>>> >> >> >>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >> >>> >>>>> >> >>> Thanks for picking up on this. >>>>> >> >>> >>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the >>>>> document >>>>> >> >>> you linked. >>>>> >> >>> >>>>> >> >>> Regarding lazy consensus, if the board in general has less of >>>>> an issue >>>>> >> >>> with that, sure. As long as it is clearly announced, lasts at >>>>> least >>>>> >> >>> 72 hours, and has a clear outcome. >>>>> >> >>> >>>>> >> >>> The other points are hard to comment on without being able to >>>>> see the >>>>> >> >>> text in question. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin < >>>>> r...@databricks.com> >>>>> >> >>> wrote: >>>>> >> >>> > I just looked through the entire thread again tonight - there >>>>> are a >>>>> >> >>> > lot >>>>> >> >>> > of >>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the first >>>>> crack >>>>> >> >>> > at >>>>> >> >>> > the >>>>> >> >>> > proposal. >>>>> >> >>> > >>>>> >> >>> > I want to first comment on the context. Spark is one of the >>>>> most >>>>> >> >>> > innovative >>>>> >> >>> > and important projects in (big) data -- overall technical >>>>> decisions >>>>> >> >>> > made in >>>>> >> >>> > Apache Spark are sound. But of course, a project as large and >>>>> active >>>>> >> >>> > as >>>>> >> >>> > Spark always have room for improvement, and we as a community >>>>> should >>>>> >> >>> > strive >>>>> >> >>> > to take it to the next level. >>>>> >> >>> > >>>>> >> >>> > To that end, the two biggest areas for improvements in my >>>>> opinion >>>>> >> >>> > are: >>>>> >> >>> > >>>>> >> >>> > 1. Visibility: There are so much happening that it is >>>>> difficult to >>>>> >> >>> > know >>>>> >> >>> > what >>>>> >> >>> > really is going on. For people that don't follow closely, it >>>>> is >>>>> >> >>> > difficult to >>>>> >> >>> > know what the important initiatives are. Even for people that >>>>> do >>>>> >> >>> > follow, it >>>>> >> >>> > is difficult to know what specific things require their >>>>> attention, >>>>> >> >>> > since the >>>>> >> >>> > number of pull requests and JIRA tickets are high and it's >>>>> difficult >>>>> >> >>> > to >>>>> >> >>> > extract signal from noise. >>>>> >> >>> > >>>>> >> >>> > 2. Solicit user (broadly defined, including developers >>>>> themselves) >>>>> >> >>> > input >>>>> >> >>> > more proactively: At the end of the day the project provides >>>>> value >>>>> >> >>> > because >>>>> >> >>> > users use it. Users can't tell us exactly what to build, but >>>>> it is >>>>> >> >>> > important >>>>> >> >>> > to get their inputs. >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > I've taken Cody's doc and edited it: >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x- >>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b >>>>> >> >>> > (I've made all my modifications trackable) >>>>> >> >>> > >>>>> >> >>> > There are couple high level changes I made: >>>>> >> >>> > >>>>> >> >>> > 1. I've consulted a board member and he recommended lazy >>>>> consensus >>>>> >> >>> > as >>>>> >> >>> > opposed to voting. The reason being in voting there can >>>>> easily be a >>>>> >> >>> > "loser' >>>>> >> >>> > that gets outvoted. >>>>> >> >>> > >>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to >>>>> "optional >>>>> >> >>> > design >>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside >>>>> from >>>>> >> >>> > tagging >>>>> >> >>> > things and linking them elsewhere simply having design docs >>>>> and >>>>> >> >>> > prototypes >>>>> >> >>> > implementations in PRs is not something that has not worked >>>>> so far". >>>>> >> >>> > >>>>> >> >>> > 3. I made some the language tweaks to focus more on >>>>> visibility. For >>>>> >> >>> > example, >>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than >>>>> just >>>>> >> >>> > "involve". SIPs should also have at least two emails that go >>>>> to >>>>> >> >>> > dev@. >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > While I was editing this, I thought we really needed a >>>>> suggested >>>>> >> >>> > template >>>>> >> >>> > for design doc too. I will get to that too ... >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin < >>>>> r...@databricks.com> >>>>> >> >>> > wrote: >>>>> >> >>> >> >>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a >>>>> >> >>> >> closer >>>>> >> >>> >> look >>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1. >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >>>>> >> >>> >> <van...@cloudera.com> >>>>> >> >>> >> wrote: >>>>> >> >>> >>> >>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not >>>>> >> >>> >>> explicitly >>>>> >> >>> >>> called, that voting would happen by e-mail? A template for >>>>> the >>>>> >> >>> >>> proposal document (instead of just a bullet nice) would >>>>> also be >>>>> >> >>> >>> nice, >>>>> >> >>> >>> but that can be done at any time. >>>>> >> >>> >>> >>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a >>>>> >> >>> >>> candidate >>>>> >> >>> >>> for a SIP, given the scope of the work. The document >>>>> attached even >>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to >>>>> try >>>>> >> >>> >>> out >>>>> >> >>> >>> the process... >>>>> >> >>> >>> >>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >>>>> >> >>> >>> <c...@koeninger.org> >>>>> >> >>> >>> wrote: >>>>> >> >>> >>> > Now that spark summit europe is over, are any committers >>>>> >> >>> >>> > interested >>>>> >> >>> >>> > in >>>>> >> >>> >>> > moving forward with this? >>>>> >> >>> >>> > >>>>> >> >>> >>> > >>>>> >> >>> >>> > >>>>> >> >>> >>> > >>>>> >> >>> >>> > https://github.com/koeninger/s >>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md >>>>> >> >>> >>> > >>>>> >> >>> >>> > Or are we going to let this discussion die on the vine? >>>>> >> >>> >>> > >>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote: >>>>> >> >>> >>> >> Maybe my mail was not clear enough. >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other >>>>> >> >>> >>> >> framework. >>>>> >> >>> >>> >> The >>>>> >> >>> >>> >> idea with benchmarks was to show two things: >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> - why some people are doing bad PR for Spark >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> - how - in easy way - we can change it and show that >>>>> Spark is >>>>> >> >>> >>> >> still on >>>>> >> >>> >>> >> the >>>>> >> >>> >>> >> top >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I >>>>> don't think >>>>> >> >>> >>> >> they're the >>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page >>>>> there >>>>> >> >>> >>> >> is >>>>> >> >>> >>> >> still >>>>> >> >>> >>> >> chart >>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that >>>>> framework is >>>>> >> >>> >>> >> not >>>>> >> >>> >>> >> the >>>>> >> >>> >>> >> same >>>>> >> >>> >>> >> Spark with other API, but much faster and optimized, >>>>> comparable >>>>> >> >>> >>> >> or >>>>> >> >>> >>> >> even >>>>> >> >>> >>> >> faster than other frameworks. >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> About real-time streaming, I think it would be just good >>>>> to see >>>>> >> >>> >>> >> it >>>>> >> >>> >>> >> in >>>>> >> >>> >>> >> Spark. >>>>> >> >>> >>> >> I very like current Spark model, but many voices that >>>>> says "we >>>>> >> >>> >>> >> need >>>>> >> >>> >>> >> more" - >>>>> >> >>> >>> >> community should listen also them and try to help them. >>>>> With >>>>> >> >>> >>> >> SIPs >>>>> >> >>> >>> >> it >>>>> >> >>> >>> >> would >>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that >>>>> may be >>>>> >> >>> >>> >> changed >>>>> >> >>> >>> >> with >>>>> >> >>> >>> >> SIP". >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> I very like unification via Datasets, but there is a lot >>>>> of >>>>> >> >>> >>> >> algorithms >>>>> >> >>> >>> >> inside - let's make easy API, but with strong background >>>>> >> >>> >>> >> (articles, >>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is >>>>> still >>>>> >> >>> >>> >> modern >>>>> >> >>> >>> >> framework. >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >>>>> >> >>> >>> >> organizational >>>>> >> >>> >>> >> ideas >>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail >>>>> was just >>>>> >> >>> >>> >> to >>>>> >> >>> >>> >> show >>>>> >> >>> >>> >> some >>>>> >> >>> >>> >> aspects from my side, so from theside of developer and >>>>> person >>>>> >> >>> >>> >> who >>>>> >> >>> >>> >> is >>>>> >> >>> >>> >> trying >>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other >>>>> ways) >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> Pozdrawiam / Best regards, >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> Tomasz >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> ________________________________ >>>>> >> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org> >>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >>>>> >> >>> >>> >> Do: Debasish Das >>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org >>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is >>>>> missing my >>>>> >> >>> >>> >> point. >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> My point is evolve or die. Spark's governance and >>>>> organization >>>>> >> >>> >>> >> is >>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it >>>>> needs >>>>> >> >>> >>> >> to >>>>> >> >>> >>> >> change. >>>>> >> >>> >>> >> >>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>>>> >> >>> >>> >> <debasish.da...@gmail.com> >>>>> >> >>> >>> >> wrote: >>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up >>>>> Spark >>>>> >> >>> >>> >>> in >>>>> >> >>> >>> >>> 2014 >>>>> >> >>> >>> >>> as >>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java >>>>> >> >>> >>> >>> map-reduce >>>>> >> >>> >>> >>> and >>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code >>>>> fun...But >>>>> >> >>> >>> >>> now >>>>> >> >>> >>> >>> as >>>>> >> >>> >>> >>> we >>>>> >> >>> >>> >>> went >>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets >>>>> more >>>>> >> >>> >>> >>> prominent, I >>>>> >> >>> >>> >>> think it is time to bring a messaging model in >>>>> conjunction >>>>> >> >>> >>> >>> with >>>>> >> >>> >>> >>> the >>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good >>>>> at....akka-streams >>>>> >> >>> >>> >>> close >>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a >>>>> great >>>>> >> >>> >>> >>> direction to >>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 >>>>> integrated >>>>> >> >>> >>> >>> streaming >>>>> >> >>> >>> >>> with >>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is >>>>> sufficient >>>>> >> >>> >>> >>> to >>>>> >> >>> >>> >>> run >>>>> >> >>> >>> >>> SQL >>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL >>>>> >> >>> >>> >>> processing at >>>>> >> >>> >>> >>> streaming data within 1-2 seconds ? >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> After reading the email chain, I started to look into >>>>> Flink >>>>> >> >>> >>> >>> documentation >>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I think >>>>> we >>>>> >> >>> >>> >>> have >>>>> >> >>> >>> >>> major >>>>> >> >>> >>> >>> work >>>>> >> >>> >>> >>> to do detailing out Spark internals so that more people >>>>> from >>>>> >> >>> >>> >>> community >>>>> >> >>> >>> >>> start >>>>> >> >>> >>> >>> to take active role in improving the issues so that >>>>> Spark >>>>> >> >>> >>> >>> stays >>>>> >> >>> >>> >>> strong >>>>> >> >>> >>> >>> compared to Flink. >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>> uence/display/SPARK/Spark+Internals >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>> uence/display/FLINK/Flink+Internals >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch >>>>> and >>>>> >> >>> >>> >>> batch...We >>>>> >> >>> >>> >>> (and >>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine >>>>> for >>>>> >> >>> >>> >>> stream >>>>> >> >>> >>> >>> and >>>>> >> >>> >>> >>> query >>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art >>>>> engine >>>>> >> >>> >>> >>> for >>>>> >> >>> >>> >>> high >>>>> >> >>> >>> >>> speed >>>>> >> >>> >>> >>> streaming data and user queries as well ! >>>>> >> >>> >>> >>> >>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>>>> >> >>> >>> >>> <tomasz.gaw...@outlook.com> >>>>> >> >>> >>> >>> wrote: >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Hi everyone, >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my >>>>> suggestions may >>>>> >> >>> >>> >>>> help a >>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational >>>>> topics were >>>>> >> >>> >>> >>>> mentioned, >>>>> >> >>> >>> >>>> but I want to focus on these negative posts about >>>>> Spark and >>>>> >> >>> >>> >>>> about >>>>> >> >>> >>> >>>> "haters" >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good >>>>> community >>>>> >> >>> >>> >>>> - >>>>> >> >>> >>> >>>> it's >>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on >>>>> >> >>> >>> >>>> "framework >>>>> >> >>> >>> >>>> market" >>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data >>>>> >> >>> >>> >>>> communities, >>>>> >> >>> >>> >>>> maybe my mail will inspire someone :) >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have >>>>> enough time >>>>> >> >>> >>> >>>> to >>>>> >> >>> >>> >>>> join >>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why >>>>> are >>>>> >> >>> >>> >>>> some >>>>> >> >>> >>> >>>> people >>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like >>>>> it was >>>>> >> >>> >>> >>>> posted >>>>> >> >>> >>> >>>> in >>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is >>>>> better >>>>> >> >>> >>> >>>> in >>>>> >> >>> >>> >>>> all >>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where >>>>> >> >>> >>> >>>> started >>>>> >> >>> >>> >>>> after >>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at >>>>> StackOverflow >>>>> >> >>> >>> >>>> "Flink >>>>> >> >>> >>> >>>> vs >>>>> >> >>> >>> >>>> ...." >>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers >>>>> are >>>>> >> >>> >>> >>>> sometimes >>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users >>>>> (often >>>>> >> >>> >>> >>>> PMC's) >>>>> >> >>> >>> >>>> are >>>>> >> >>> >>> >>>> just posting same information about real-time >>>>> streaming, >>>>> >> >>> >>> >>>> about >>>>> >> >>> >>> >>>> delta >>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is >>>>> marked as >>>>> >> >>> >>> >>>> an >>>>> >> >>> >>> >>>> aswer, >>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the >>>>> truth. >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and >>>>> knowledgle to >>>>> >> >>> >>> >>>> perform >>>>> >> >>> >>> >>>> huge >>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports >>>>> Spark >>>>> >> >>> >>> >>>> (Databricks, >>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in >>>>> community :) ) >>>>> >> >>> >>> >>>> could >>>>> >> >>> >>> >>>> perform performance test of: >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because >>>>> of >>>>> >> >>> >>> >>>> mini-batch >>>>> >> >>> >>> >>>> model, however currently the difference should be much >>>>> lower >>>>> >> >>> >>> >>>> that in >>>>> >> >>> >>> >>>> previous versions >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> - Machine Learning models >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> - batch jobs >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> - Graph jobs >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> - SQL queries >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a >>>>> modern >>>>> >> >>> >>> >>>> framework, >>>>> >> >>> >>> >>>> because after reading posts mentioned above people may >>>>> think >>>>> >> >>> >>> >>>> "it >>>>> >> >>> >>> >>>> is >>>>> >> >>> >>> >>>> outdated, future is in framework X". >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how >>>>> Spark >>>>> >> >>> >>> >>>> Structured >>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of >>>>> easy-of-use >>>>> >> >>> >>> >>>> and >>>>> >> >>> >>> >>>> reliability. Performance tests, done in various >>>>> environments >>>>> >> >>> >>> >>>> (in >>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, >>>>> >> >>> >>> >>>> 20-node >>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to >>>>> say >>>>> >> >>> >>> >>>> "hey, >>>>> >> >>> >>> >>>> you're >>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster >>>>> and is >>>>> >> >>> >>> >>>> still >>>>> >> >>> >>> >>>> getting even more fast!". This would be based on facts >>>>> (just >>>>> >> >>> >>> >>>> numbers), >>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for >>>>> marketing >>>>> >> >>> >>> >>>> puproses >>>>> >> >>> >>> >>>> and >>>>> >> >>> >>> >>>> for every Spark developer >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time >>>>> ago about >>>>> >> >>> >>> >>>> real-time >>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some >>>>> work >>>>> >> >>> >>> >>>> should be >>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's >>>>> possible. >>>>> >> >>> >>> >>>> Maybe >>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top >>>>> of >>>>> >> >>> >>> >>>> Akka? >>>>> >> >>> >>> >>>> I >>>>> >> >>> >>> >>>> don't >>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think >>>>> that >>>>> >> >>> >>> >>>> Spark >>>>> >> >>> >>> >>>> should >>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many >>>>> >> >>> >>> >>>> posts/comments >>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is >>>>> doing >>>>> >> >>> >>> >>>> very >>>>> >> >>> >>> >>>> good >>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is >>>>> possible to >>>>> >> >>> >>> >>>> add >>>>> >> >>> >>> >>>> also >>>>> >> >>> >>> >>>> more >>>>> >> >>> >>> >>>> real-time processing. >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal >>>>> of SIP. >>>>> >> >>> >>> >>>> I'm >>>>> >> >>> >>> >>>> also >>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not >>>>> listen to >>>>> >> >>> >>> >>>> users, >>>>> >> >>> >>> >>>> but >>>>> >> >>> >>> >>>> they really want to make Spark better for every user. >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> What do you think about these two topics? Especially >>>>> I'm >>>>> >> >>> >>> >>>> looking >>>>> >> >>> >>> >>>> at >>>>> >> >>> >>> >>>> Cody >>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :) >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> Tomasz >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>> >>>>> >> >>> >>> >>>>> >> >>> >> >>>>> >> >>> > >>>>> >> >>> > >>>>> >> > >>>>> >> > >>>>> >> >>>>> >> ------------------------------------------------------------ >>>>> --------- >>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Ryan Blue >>>>> > Software Engineer >>>>> > Netflix >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Joseph Bradley >>>> >>>> Software Engineer - Machine Learning >>>> >>>> Databricks, Inc. >>>> >>>> [image: http://databricks.com] <http://databricks.com/> >>>> >>> >>> >> >