[
https://issues.apache.org/jira/browse/MAHOUT-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255330#comment-14255330
]
Andrew Palumbo edited comment on MAHOUT-1493 at 12/22/14 1:05 AM:
------------------------------------------------------------------
Thanks for looking at this Pat. My thinking up to now as far as using
Sequence files is that the input to Naive Bayes (for now) would be the output
from seq2sparse; a <Text,VectorWritable> Sequence file. This is really just to
keep it simple for now and to, and the CLI drivers that I wrote are a rough cut
that I basically copied and pasted from your Item-similarity drivers (very cool
by the way).
I'm all for taking other input formats. Especially since were hopefully
moving on from MR seq2sparse soon. I need to read up on what you did with the
Readers and Writers.
The issue that alot of people are having on the list with MRLegacy NB (if I'm
think of the issue same as you mentioned) is that each Vectorized Document has
to contain a Category in its Key. Since the Mahout Process for text
vectorization has been to seperate documents into directorys by their
respective categories, then run `mahout seqdirectory` which converts each
document into into a sequence file with \directory\doc_id as a key then
seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was
hardcoded as a split on "\" and labeling the document as with the first token
after the split (the directory id). I've kind of taken this as the convention.
Internally the documents ids are discarded and the TF-IDF weights are
aggregated by category and then individually discarded.
I've tried to relax the Category extraction convention in the DSL Naive Bayes
by allowing the -user- _developer_ the ability to pass an arbitrary `String =>
String` function in the aggregation constructor. This way the Labels can be
extracted by e.g. a regex pattern. I'd like to incorporate this as an option
into the CLI driver but haven't really made it this far yet.
So I think that with any format, there may be some confusion over the document
labeling. But I agree that we should support other file formats as input.
was (Author: andrew_palumbo):
Thanks for looking at this Pat. My thinking up to now as far as using
Sequence files is that the input to Naive Bayes (for now) would be the output
from seq2sparse; a <Text,VectorWritable> Sequence file. This is really just to
keep it simple for now and to, and the CLI drivers that I wrote are a rough cut
that I basically copied and pasted from your Item-similarity drivers (very cool
by the way).
I'm all for taking other input formats. Especially since were hopefully
moving on from MR seq2sparse soon. I need to read up on what you did with the
Readers and Writers.
The issue that alot of people are having on the list with MRLegacy NB (if I'm
think of the issue same as you mentioned) is that each Vectorized Document has
to contain a Category in its Key. Since the Mahout Process for text
vectorization has been to seperate documents into directorys by their
respective categories, then run `mahout seqdirectory` which converts each
document into into a sequence file with \directory\doc_id as a key then
seq2sparse, ... ect... So since the category extraction step in MRLegacy NB was
hardcoded as a split on "\" and labeling the document as with the first token
after the split (the directory id). I've kind of taken this as the convention.
Internally the documents ids are discarded and the TF-IDF weights are
aggregated by category and then individually discarded.
I've tried to relax the Category extraction convention in the DSL Naive Bayes
by allowing the user the ability to pass an arbitrary `String => String`
function in the aggregation constructor. This way the Labels can be extracted
by e.g. a regex pattern. I'd like to incorporate this as an option into the
CLI driver but haven't really made it this far yet.
So I think that with any format, there may be some confusion over the document
labeling. But I agree that we should support other file formats as input.
> Port Naive Bayes to the Spark DSL
> ---------------------------------
>
> Key: MAHOUT-1493
> URL: https://issues.apache.org/jira/browse/MAHOUT-1493
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Reporter: Sebastian Schelter
> Assignee: Andrew Palumbo
> Fix For: 1.0
>
> Attachments: MAHOUT-1493.patch, MAHOUT-1493.patch, MAHOUT-1493.patch,
> MAHOUT-1493.patch, MAHOUT-1493a.patch
>
>
> Port our Naive Bayes implementation to the new spark dsl. Shouldn't require
> more than a few lines of code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)