Duplicated from a comment on the PR:

Beyond these details (specific merge issues)  I have a bigger problem with 
merging this. Now every time the DSL is changed it may break things in h2o 
specific code. Merging this would require every committer who might touch the 
DSL to sign up for fixing any broken tests on both engines. 

To solve this the entire data prep pipeline must be virtualized to run on 
either engine so the tests for things like CF and ItemSimilarity or matrix 
factorization (and the multitude of others to come) pass and are engine 
independent. As it stands any DSL change that breaks the build will have to 
rely on a contributor's fix. Even if one of you guys was made a committer we 
will have this problem where a needed change breaks one or the other engine 
specific code. Unless 99% of the entire pipeline is engine neutral the build 
will be unmaintainable.

For instance I am making a small DSL change that is required for cooccurrence 
and ItemSimilarity to work. This would break ItemSimilarity and its tests, 
which are in the spark module but since I’m working on that I can fix 
everything. If someone working on an h2o specific thing had to change the DSL 
in a way that broke spark code like ItemSimilarity you might not be able to fix 
it and I certainly do not want to fix stuff in h2o specific code when I change 
the DSL. I have a hard enough time keeping mine running :-) 

Crudely speaking this means doing away with all references to a SparkContext 
and any use of it. So it's not just a matter of reproducing the spark module 
but reducing the need for one. Making it so small that breakages in one or the 
other engines code will be infrequent and changes to neutral code will only 
rarely break an engine that the committer is unfamiliar with.

I raised this red flag a long time ago but in the heat of other issues it got 
lost. I don't think this can be ignored anymore.

I would propose that we should remain two separate projects with a mostly 
shared DSL until the maintainability issues are resolved. This seems way to 
early to merge.


On Jul 11, 2014, at 2:40 AM, Anand Avati <[email protected]> wrote:

Hi all,
The H2O integration is now feature complete till date and is ready for
final review. All the test cases are passing. The pull request
https://github.com/apache/mahout/pull/21 has been updated with the latest
code. Please treat this PR as a candidate for merge.

I have written a brief document on how to set up and use/test the
integration at
https://github.com/avati/mahout/blob/MAHOUT-1500/h2o/README.md. That
includes instructions to test in both local and distributed mode.

I would really appreciate if folks can review the work and provide
feedback, and the next steps.

Thanks,
Avati

Reply via email to