[ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237320#comment-14237320
 ] 

ASF GitHub Bot commented on MAHOUT-1493:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/32#issuecomment-65961642
  
    I’ve had some time to work on this recently and have Naïve Bayes 
implemented in the math-scala and spark modules with basic CLI drivers (for 
Spark) and working with the classify-20newsgroups.sh example.  It still needs 
some refactoring and cleanup. In the math-scala implementation, there are some 
problems with the `NaiveBayes.extractLabelsAndAggregateObservations(…)` 
function in NaiveBayes.scala.  I’ve put together a hack using `mapBlock(…)` 
which works, but very inefficiently (uses 2 transposes on a large set and ).  
Not very pretty.  As well in order to convert from a String-Keyed to an 
Int-Keyed DRM it is necessary to collect the entire dataset up front.   I’m 
going to investigate a .toIntKeyed() function for DrmLike  though I’m still not 
all that familiar with the RDD backing. So currently I’m just overriding it in 
a SparkNaiveBayes object in the spark package. 
    
    There are three (trivial) MRLegacy dependencies:  
`ComplementaryThetaTrainer`, `ResultAnalyzer` and  ClassifierResult.  I’ll 
probably port `ComplementaryThetaTrainer` into the Math-Scala module.  Maybe 
`ComplementaryThetaTrainer` and `ResultAnalyzer` are candidates to be moved 
into Mahout-Math.  I’m not sure if we’re trying to keep MRLegacy out of 
Math-Scala?
    A few issues that I still need to work out:
    
    1.  A bug in the H20Drm –` H20Helper.java` uses a `water.fvec.Vec` vector 
to store label keys but `water.fvec.Vec` does not accept String values (throws 
an error).  So we need a new distributed data structure to store String Row 
Labels for H20 DRMs. Otherwise this should be working for H20. 
    
    2.  Need a distributed method for conversion for the conversion from String 
to Int-Keyed DRMs (as mentioned above)
    
    3.  As is, the `NBModel` is not fully serializable (several 
`RandomAccessSparseVectors` fields and a `Matrix` field in the class).  I’m 
still working out a way to broadcast it to the mapBlock closure. Should be an 
relatively straightforward fix.  So NaiveBayes.test(…) is running sequentially 
for now, and collecting the full test set up front.  I haven’t looked at it too 
closely, but I’m wondering if we can add a `drmBroadcast(…)` method for 
arbitrary (serializable) Broadcast Objects?  
    
    Next steps:
    
    1.  Address above issues.
    
    2.  Add in more CLI options.
     
    3.  Add more tests.
    
    4.  Implement NaiveBayes.classifyNew(…) method as outlined in MAHOUT-1564. 
    
    
    If there are no objections, I’ll probably commit this after a bit more 
cleanup, and after some style fixes, and then work on the next steps from there.
    
    Any input is appreciated.



> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
>                 Key: MAHOUT-1493
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>            Reporter: Sebastian Schelter
>            Assignee: Andrew Palumbo
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
> MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
> more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to