My sense on all this is that it should be done on a case-by-case basis. To add a new API, it needs to be general enough that a lot of users will want to use it. If adding that API confuses users, that’s a problem. However, on the flip side, if it’s not a super-popular function but it’s just 10-20 lines of code, it may still be worth having. The maintenance burden on that is not too high, and users are used to fairly extensive collection libraries.
For the joins in particular, we added them because it’s quite easy to mess up writing joins by hand, even once you have cogroup(). One thing we do want to do is start implementing more specialized functionality, like statistics functions, in separate libraries. Right now there are some functions in the RDD API (e.g. sums, means, histograms, etc) that are fairly specific to this domain. Matei On Feb 23, 2014, at 10:18 AM, Amandeep Khurana <ama...@gmail.com> wrote: > This makes sense. Thanks for clarifying, Mridul. > > As Sean pointed out - a contrib module quickly turns into a legacy code > base that becomes hard to maintain. From that perspective, I think the idea > of a separate sparkbank github that is maintained by Spark contributors > (along with users who wish to contribute add-ons like you've described) and > adhere to the code quality and reviews like the main project seems > appealing. And then not just sparkbank but other things that people might > want to have as a part of the project but doesn't belong to the core > codebase can go there? I don't know if things like this have come up in the > past pull requests. > > -Amandeep > > PS: I'm not a spark committer/contributor so take my opinion fwiw. :) > > > On Sun, Feb 23, 2014 at 1:40 AM, Mridul Muralidharan <mri...@gmail.com>wrote: > >> Good point, and I was purposefully vague on that since that is something >> which our community should evolve imo : this was just an initial proposal >> :-) >> >> For example: there are multiple ways to do cartesian - and each has its own >> trade offs. >> >> Another candidate could be, as I mentioned, new methods which can be >> expressed as sequences of existing methods but would be slightly more >> performent if done in one shot - like the self cartesian pr, various types >> of join (which can become a contrib of its own btw !), experiments using >> key indexes, ordering, etc. >> >> Addition into sparkbank or contrib (or something bettrr named !) does not >> preclude future migration into core ... just an initial staging area for us >> to e olve the api and get user feedback; without necessarily making spark >> core api unstable. >> >> Obviously, it is not a dumping ground for broken code/ideas ... and must >> follow same level of scrutiny and rigour before committing. >> Regards >> Mridul >> On Feb 23, 2014 11:53 AM, "Amandeep Khurana" <ama...@gmail.com> wrote: >> >>> Mridul, >>> >>> Can you give examples of APIs that people have contributed (or wanted >>> to contribute) but you categorize as something that would go into >>> piggybank-like (sparkbank)? Curious to know how you'd decide what >>> should go where. >>> >>> Amandeep >>> >>>> On Feb 22, 2014, at 10:06 PM, Mridul Muralidharan <mri...@gmail.com> >>> wrote: >>>> >>>> Hi, >>>> >>>> Over the past few months, I have seen a bunch of pull requests which >>> have >>>> extended spark api ... most commonly RDD itself. >>>> >>>> Most of them are either relatively niche case of specialization (which >>>> might not be useful for most cases) or idioms which can be expressed >>>> (sometimes with minor perf penalty) using existing api. >>>> >>>> While all of them have non zero value (hence the effort to contribute, >>> and >>>> gladly welcomed !) they are extending the api in nontrivial ways and >>> have a >>>> maintenance cost ... and we already have a pending effort to clean up >> our >>>> interfaces prior to 1.0 >>>> >>>> I believe there is a need to keep exposed api succint, expressive and >>>> functional in spark; while at the same time, encouraging extensions and >>>> specialization within spark codebase so that other users can benefit >> from >>>> the shared contributions. >>>> >>>> One approach could be to start something akin to piggybank in pig to >>>> contribute user generated specializations, helper utils, etc : bundled >> as >>>> part of spark, but not part of core itself. >>>> >>>> Thoughts, comments ? >>>> >>>> Regards, >>>> Mridul >>> >>