[jira] [Comment Edited] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

Andrew Palumbo (JIRA) Sun, 21 Dec 2014 17:06:23 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255330#comment-14255330
 ]


Andrew Palumbo edited comment on MAHOUT-1493 at 12/22/14 1:05 AM:
------------------------------------------------------------------

Thanks for looking at this Pat.   My thinking up to now as far as using 
Sequence files is that the input to Naive Bayes (for now) would be the output 
from seq2sparse; a <Text,VectorWritable> Sequence file.  This is really just to 
keep it simple for now and to, and the CLI drivers that I wrote are a rough cut 
that I basically copied and pasted from your Item-similarity drivers (very cool 
by the way).   

I'm all for taking other input formats.   Especially since were hopefully 
moving on from MR seq2sparse soon.  I need to read up on what you did with the 
Readers and Writers.  

The issue that alot of people are having on the list with MRLegacy NB (if I'm 
think of the issue same as you mentioned) is that each Vectorized Document has 
to contain a Category in its Key.  Since the Mahout Process for text 
vectorization has been to seperate documents into directorys by their 
respective categories, then run `mahout seqdirectory` which converts each 
document into into a sequence file with \directory\doc_id as a key then 
seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was 
hardcoded as a split on "\" and labeling the document as with the first token 
after the split (the directory id). I've kind of taken this as the convention.  

Internally the documents ids are discarded and the TF-IDF weights are 
aggregated by category and then individually discarded. 

I've tried to relax the Category extraction convention in the DSL Naive Bayes 
by allowing the -user-  _developer_ the ability to pass an arbitrary `String => 
String`  function in the aggregation constructor.  This way the Labels can be 
extracted by e.g. a regex pattern.  I'd like to incorporate this as an option 
into the CLI driver but haven't really made it this far yet.  

So I think that with any format, there may be some confusion over the document 
labeling.   But I agree that we should support other file formats as input.  





was (Author: andrew_palumbo):
Thanks for looking at this Pat.   My thinking up to now as far as using 
Sequence files is that the input to Naive Bayes (for now) would be the output 
from seq2sparse; a <Text,VectorWritable> Sequence file.  This is really just to 
keep it simple for now and to, and the CLI drivers that I wrote are a rough cut 
that I basically copied and pasted from your Item-similarity drivers (very cool 
by the way).   

I'm all for taking other input formats.   Especially since were hopefully 
moving on from MR seq2sparse soon.  I need to read up on what you did with the 
Readers and Writers.  

The issue that alot of people are having on the list with MRLegacy NB (if I'm 
think of the issue same as you mentioned) is that each Vectorized Document has 
to contain a Category in its Key.  Since the Mahout Process for text 
vectorization has been to seperate documents into directorys by their 
respective categories, then run `mahout seqdirectory` which converts each 
document into into a sequence file with \directory\doc_id as a key then 
seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was 
hardcoded as a split on "\" and labeling the document as with the first token 
after the split (the directory id). I've kind of taken this as the convention.  

Internally the documents ids are discarded and the TF-IDF weights are 
aggregated by category and then individually discarded. 

I've tried to relax the Category extraction convention in the DSL Naive Bayes 
by allowing the user the ability to pass an arbitrary `String => String`  
function in the aggregation constructor.  This way the Labels can be extracted 
by e.g. a regex pattern.  I'd like to incorporate this as an option into the 
CLI driver but haven't really made it this far yet.  

So I think that with any format, there may be some confusion over the document 
labeling.   But I agree that we should support other file formats as input.  




> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
>                 Key: MAHOUT-1493
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1493
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>            Reporter: Sebastian Schelter
>            Assignee: Andrew Palumbo
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch, 
> MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require 
> more than a few lines of code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MAHOUT-1493) Port Naive Bayes to the Spark DSL

Reply via email to