Re: producing vectors from composite documents

Robin Anil Tue, 08 Jun 2010 00:21:08 -0700

Now I am going to put some comments from the CNB classifier perspective,
lets see how we can integrate things together



>  Generally, Euclidean or L_1 distances are about all that makes sense for
> these vectors.  For clustering, I worry that I don't take IDF into account
> (there is some provision for that in the AdaptiveWordEncoder, though).  For
> most learning applications, IDF shouldn't matter except that it might make
> convergence faster by reducing the size of the largest eigenvalue.

NB/CNB actually computes a sort of cluster average for a class and selects
the one with the minimum cosine/dot product

First the documents are normalized, then normalized sums of weights are
computed instead of computing the word count. This is the key step which
boosts the classification accuracy on text. I can move this to the document
vectorizer.
Once we have the tfidf vectors, CNB computes sum of weights for each feature
and for each label and totally for the whole matrix. So it sort of looks
like the 2x2 LLR computation table. The the actual feature weight in a class
vector is calculated using these 4 values

With this new vectorization, can we hash sparse features to a particular id
range and ask the tfidf job to compute tfidf for just that portion?. This
means, I can delete away the tfidf calculation code for CNB. This can exist
as a separate vectorizer. And both clustering and classification can use it.
It will partially kill its online nature. We can circumvent that using a
Document-Frequency Map to compute approximate tf-idf during online stage


>
> > About the dictionary based trace. I need to actually see how the trace is
> > useful. Do you keep track of the most important feature from those that
> go
> > into a particular hashed location?.
> >
>
> Right now, I pretty much assume that there are no collisions.  This isn't
> always all that great an assumption.  To get rid of that problem, it is
> probably pretty easy to do a relaxation step where I generate an
> explanation
> for a vector and then generate the vector for that explanation.  If there
> are collisions, this last vector will differ slightly from the original and
> the explanation of the difference should get us much closer to the
> original.
>
> Actually, I am not seeing the big need at the moment to reverse engineer
data, its good for debugging but not so necessary in production. Let
prioritise on gettting this plugged in, and work on this after.

Re: producing vectors from composite documents

Reply via email to