On Wed, Jun 21, 2017 at 10:37 PM, Stack <[email protected]> wrote: > On Wed, Jun 21, 2017 at 5:26 PM, Andrew Purtell <[email protected]> wrote: > >> I seem to recall that what eventually was committed to master as >> hbase-spark was first shopped to the Spark project, who felt the same, that >> it should be hosted elsewhere. > > > I have the same remembrance. > > >> .... I would draw an analogy with >> mapreduce: we had what we called 'first class' mapreduce integration, spark >> is the alleged successor to mapreduce, we should evolve that support as >> such. I'd like to know if that reasoning, or other rationale, is sufficient >> at this time. >> >> > Spark should be first-class on equal footing with MR if not more so (our MR > integration is too tightly bound up with our internals badly in need of > untangling). > > Reading over the scope of work Sean outlines -- the variants, pom profiles, > the module profusion, and the uncertainties -- makes me queasy pulling it > all in. > > I'm working on a little mini-hbase project at the mo to shade guava, etc., > and it is easy going. Made me think we could do a mini-project to host > spark so we could contain it should it go up in flames. > > S
I think the current approach of keeping all the spark related stuff in a set of modules that we don't depend on for our other bits sufficiently isolates us from the risk of things blowing up. For example, when we're ready to build some of our admin tools on the spark integration instead of MR we can update them to use Java Services API or some similar runtime loading method to avoid having a dependency directly on the Spark artifacts. It's true that we could put this into a different repo with its own release cycle, but I suspect that will lead to even more build pain. Especially given that it's likely to remain under active development for the foreseeable future and we'll want to package some version of it in our convenience binary assembly. Contrast with our third party dependencies, which tend to remain the same over relatively large timespans (e.g. a major version). If we end up voting on releases that cover a version from both this hypothetical hbase-spark repo and the main repo, what would we have really gained by splitting the two up?
