Duplicated from a comment on the PR: Beyond these details (specific merge issues) I have a bigger problem with merging this. Now every time the DSL is changed it may break things in h2o specific code. Merging this would require every committer who might touch the DSL to sign up for fixing any broken tests on both engines.
To solve this the entire data prep pipeline must be virtualized to run on either engine so the tests for things like CF and ItemSimilarity or matrix factorization (and the multitude of others to come) pass and are engine independent. As it stands any DSL change that breaks the build will have to rely on a contributor's fix. Even if one of you guys was made a committer we will have this problem where a needed change breaks one or the other engine specific code. Unless 99% of the entire pipeline is engine neutral the build will be unmaintainable. For instance I am making a small DSL change that is required for cooccurrence and ItemSimilarity to work. This would break ItemSimilarity and its tests, which are in the spark module but since I’m working on that I can fix everything. If someone working on an h2o specific thing had to change the DSL in a way that broke spark code like ItemSimilarity you might not be able to fix it and I certainly do not want to fix stuff in h2o specific code when I change the DSL. I have a hard enough time keeping mine running :-) Crudely speaking this means doing away with all references to a SparkContext and any use of it. So it's not just a matter of reproducing the spark module but reducing the need for one. Making it so small that breakages in one or the other engines code will be infrequent and changes to neutral code will only rarely break an engine that the committer is unfamiliar with. I raised this red flag a long time ago but in the heat of other issues it got lost. I don't think this can be ignored anymore. I would propose that we should remain two separate projects with a mostly shared DSL until the maintainability issues are resolved. This seems way to early to merge. On Jul 11, 2014, at 2:40 AM, Anand Avati <[email protected]> wrote: Hi all, The H2O integration is now feature complete till date and is ready for final review. All the test cases are passing. The pull request https://github.com/apache/mahout/pull/21 has been updated with the latest code. Please treat this PR as a candidate for merge. I have written a brief document on how to set up and use/test the integration at https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That includes instructions to test in both local and distributed mode. I would really appreciate if folks can review the work and provide feedback, and the next steps. Thanks, Avati
