[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058934#comment-14058934
]
ASF GitHub Bot commented on MAHOUT-1500:
----------------------------------------
Github user pferrel commented on the pull request:
https://github.com/apache/mahout/pull/21#issuecomment-48749896
Are the scalatests implemented in the Spark module that covers math-scala
code implemented here somewhere? I'd vote against merge untils those are in all
in place and passing.
The cf stuff has a rather major bug the I'm working on so I wouldn't move
this into math-scala just yet, although it would make an interesting speed
comparison once completed. The cf changes will require DSL additions that will
be under separate review. Don't have a pr number yet.
Also I may have missed it but there should be clear instructions for how to
build this and run it. This is like a heart transplant. Before you release the
patient make sure all systems are working correctly, the DSL is not the whole
body. There should at least be some end-to-end pipelines in examples that
anyone can run from a local installation.
Beyond these details I have a bigger issue with merging this. Now every
time the DSL is changed it may break things in h20 specific code. It already
does in cf for instance but I've signed up to fix those fro spark. No committer
has signed up to fix code in both Spark and H2O. IMO this is untenable.
To solve this the entire data prep pipeline must be virtualized to run on
either engine so the tests for things like CF and ItemSimilarity (and the
multitude of others to come) pass and are engine independent. As it stands any
DSL change that breaks the build will have to rely on a contributor's fix. Even
if one of you guys was made a committer we will have this problem where a
needed change breaks one or the other engine specific code. Unless 99% of the
entire pipeline is engine neutral the build will be unmaintainable.
Crudely speaking this means doing away with all references to a
SparkContext and any use of it. So it's not just a matter of reproducing the
spark module but reducing the need for one. Making it so small that breakages
in one or the other engines code will be infrequent.
I raised this red flag long ago but in the heat of other issues it seemed
minor, but I don't think it can be ignored anymore.
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL
--
This message was sent by Atlassian JIRA
(v6.2#6252)