>
> I don't understand the value of MultilabelledVector
>
> Currently I am planning a pure M/R Trainer. Having a Labelled/Multilablled
vector means, I will be able to store the label as an int. I can pass in the
list of labels as a parameter and use the items in order to generate the
label ids.
> I will now modify the DictionaryVectorizer to output the sub directory
> chain
> > as a label.
> >
>
> If DictionaryVectorizer is 20 newsgroup specific, then that is OK. In
> general, there will be too many documents
> to store one per file and it may be difficult to segregate data into one
> category per directory.
>
> SequenceFileFromDirectory will create text sequence files with name as
> > "./Subdir1/Subdir2/file"
> > DictionaryVectorizer will run an extra job which takes the named vectors
> it
> > generates, and makes labelled vectors from them.
> >
>
> I can't have an opinion here.
>
Re: to both
Yes. So I will drop this preprocessing. Let the user write their own
preprocessing. But to complete an end to end example from a directory of
documents(a.k.a 20newsgroups) . I will write the preprocessing as an MR in
examples.
>
> >
> > The questions is the handling of the LabelDictionary. This is a messy way
> > of
> > handling this. Other way is to let naivebayes read data as NamedVectors
> and
> > take care of tokenizing and extracting the label from the name (two
> choices
> >
>
> My big questions center about how this might be used in a production
> setting. In that case, the assumption
> of input in files breaks down because the user will probably have their own
> intricate input setup. If we assume
> that the input will be in the form of hashed feature vectors, then the
> following outline seems reasonable to me:
>
> algorithm = new NaiveBayes(...)
>
> for all training examples {
> int actual = target variable value
> Vector features = vectorize example
> algorithm.train(actual, features) // secretly save vector as
> appropriate
> }
>
This isnt scalable. Single process writing files to cluster. There could be
many ways of forming the input data.
1) like above
2) user writes an M/R over their input data format, writes the data in the
required input format i.e. tf-vectors or sequence file of text. The tfidf
job will then execute over this(either from tf vectors or from text). to
create the tfidf vectors, then Bayes trainer will execute using this. The
classifier will use the dictionary to map strings to ids and uses dot
product to classify
3) User writes an M/R over their input data format, uses Hashed Encoders to
create vectors. Bayes Trainer executes over the generated file. HashEncoders
are initialized in the classifier in the exact same way and classifier
classifies
Robin