On Saturday, July 12, 2014, Pat Ferrel <[email protected]> wrote:
> Why not put this argument to bed with a vote? Sraw poll or not it will > make the consensus visible so we can get on with things. I know that many > are on vacation now but please take time to vote we really need a large > sample of active committers. Feel free to give a short defense of you > position too. I further propose we keep this at the 1000 meter level and > not start quoting code—let’s look at the forest instead of the trees. > > The choice as far as I can tell is: > > 1) merge the h2o implementation of math-scala and h2o modules into > mainstream Mahout. I suppose this implies accepting h2o specific code too, > though someone can contradict me here. > 2) support h2o in integrating math and math-scala with their engine > project (even as an artifact) and be welcoming and responsive with this > support. > 3) break the DSL into it’s own project, give it a name like Mahout-core, > make all tests engine independent or live in other project code (like h2o > or Flink). Then All-the-rest implements on Spark (the rest of Mahout), h2o, > or Flink. This is the linux kernel approach, many distros but one kernel. > If one wants to really draw parallels to the linux world, the linux itself is a far better example, especially vfs and file systems of which I am intimately familiar. You have a generic vfs which implemenrs filesystem semantics (a logical layer), and a bunch of filesystems which translate them into on-disk or over-network operations each in their own way (physical layers). The kernel team has no different problems than the type you mention. There are experts in just their fs who dont care how another fs works, and vfs experts who dont care how any fs implements internally. Yet they all work off the same linux.git. Sometimes changes done in vfs results in changes to all fses which the author does not know any internals about. Yet it all works. Very similar story with device drivers. How does it work there? Because all developers and maintainers work together with the spirit of collaboration. Of course device drivers are based on proprierary hardware whose internals are not well known or published. Of course those device drivers and fses can reside as sparate projects as the kernel allows loadable modules. But why exist together? Because it forces all components to stay in sync and not drift apart as api changes are made. This in turn makes the life of the consumers of the project much easier and that is the most important goal for a project. You can argue and provide more reasons than what you already gave against merge, and vice versa. It finally boils down to the attitudes of the project maintainers towards open source collaboration. I will accept the verdict of the vote no matter what it is. It is your project after all. Thanks I support #2. The reasons: > > 1) engine specific work should be done by the experts and work done on one > engine should never affect work done on another. > 2) math-scala is the closest thing to engine independent thing we have but > it is not complete. Changes to it will need to be negotiated and cannot be > forced into a single commit as they would if breakage in h2o also breaks > the build. > 3) Every committer should not have to understand all engines. Currently > work, outside the DSL or not, often requires additions to the DSL and also > often require the committers to pick an engine or design a new abstraction. > This work of finding abstractions should not be forced into a single commit. > 4) Mahout gets no known advantage by merging this PR. The alternative is > that h2o merge it with their project. We still get the benefit of being (at > least at the algebra/ r-like api / DSL) a multi-engine project. In other > words we have proven our stated desire to support other engines. > 5) Be welcoming. Providing a key component with the optimizer and DSL > (along with all future improvements) to any and all engines and agreeing to > support it and jointly work to keep it core seem very supportive of the > open source community and mentality. There are many ways to work together > and some bad ones. > 6) Keeping the engine work separated by project boundaries but supported > by mutual PRs will be a much more maintainable and productive way to > cooperate. This is the model of choice for most modern OSS project, > especially on Github. Git was made for this. > 7) When Flink (Stratosphere) looks at cooperating with Mahout as they have > already indicated, isn’t option #2 a much better way to deal with them too. > Again the burden of integration should be with the engine, not Mahout. By > merging h2o we would be committing to merging every other viable engine. > It’s a slippery slope that the DSL alone may be able to pull off but not a > core team supporting every engine. > > I don’t favor #3 because the DSL is not complete and Mahout Spark as it’s > reference implementation should have the easiest path to modify it. Maybe > some day this will be the better alternative. > > A word about bone fides. I’m one of a vey small number of people to push > Scala or Spark code. I’m working on ItemSimilarity and a framework for > readers/writers for tuples and DRMs (text-delimited is the first) as well > as the core cooccurrence, whose primary author was Sebastian. Plans include > a revamp of the item-based recommenders based on earlier hadoop+mahout+solr > work. My work is generally outside the DSL but has required several changes > or additions to it.
