[ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058934#comment-14058934
 ] 

ASF GitHub Bot commented on MAHOUT-1500:
----------------------------------------

Github user pferrel commented on the pull request:

    https://github.com/apache/mahout/pull/21#issuecomment-48749896
  
    Are the scalatests implemented in the Spark module that covers math-scala 
code implemented here somewhere? I'd vote against merge untils those are in all 
in place and passing.
    
    The cf stuff has a rather major bug the I'm working on so I wouldn't move 
this into math-scala just yet, although it would make an interesting speed 
comparison once completed. The cf changes will require DSL additions that will 
be under separate review. Don't have a pr number yet.
    
    Also I may have missed it but there should be clear instructions for how to 
build this and run it.  This is like a heart transplant. Before you release the 
patient make sure all systems are working correctly, the DSL is not the whole 
body. There should at least be some end-to-end pipelines in examples that 
anyone can run from a local installation.
    
    Beyond these details I have a bigger issue with merging this. Now every 
time the DSL is changed it may break things in h20 specific code. It already 
does in cf for instance but I've signed up to fix those fro spark. No committer 
has signed up to fix code in both Spark and H2O. IMO this is untenable. 
    
    To solve this the entire data prep pipeline must be virtualized to run on 
either engine so the tests for things like CF and ItemSimilarity (and the 
multitude of others to come) pass and are engine independent. As it stands any 
DSL change that breaks the build will have to rely on a contributor's fix. Even 
if one of you guys was made a committer we will have this problem where a 
needed change breaks one or the other engine specific code. Unless 99% of the 
entire pipeline is engine neutral the build will be unmaintainable.
    
    Crudely speaking this means doing away with all references to a 
SparkContext and any use of it. So it's not just a matter of reproducing the 
spark module but reducing the need for one. Making it so small that breakages 
in one or the other engines code will be infrequent. 
    
    I raised this red flag long ago but in the heat of other issues it seemed 
minor, but I don't think it can be ignored anymore.


> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to