Re: producing vectors from composite documents

Ted Dunning Tue, 08 Jun 2010 10:46:23 -0700

On Tue, Jun 8, 2010 at 12:20 AM, Robin Anil <[email protected]> wrote:


>
> >  Generally, Euclidean or L_1 distances are about all that makes sense for
> > these vectors.  For clustering, I worry that I don't take IDF into
> account
> > (there is some provision for that in the AdaptiveWordEncoder, though).
>  For
> > most learning applications, IDF shouldn't matter except that it might
> make
> > convergence faster by reducing the size of the largest eigenvalue.
>
> NB/CNB actually computes a sort of cluster average for a class and selects
> the one with the minimum cosine/dot product
>

Dot product <=> Euclidean metric.

Tfidf will definitely help a lot for this sort of thing.


> First the documents are normalized, then normalized sums of weights are
> computed instead of computing the word count. This is the key step which
> boosts the classification accuracy on text. I can move this to the document
> vectorizer.
>

And the idf weighting can be done on-line or in two passes.  The two pass
approach is more precise, but not necessarily very much.  A compromise is
also possible where the first pass is a small subset of documents (say
10,000 docs).  That keeps it really fast and that dictionary can be used as
the seed for the adaptive weighting (or just used directly).


> With this new vectorization, can we hash sparse features to a particular id
> range and ask the tfidf job to compute tfidf for just that portion?. This
> means, I can delete away the tfidf calculation code for CNB. This can exist
> as a separate vectorizer. And both clustering and classification can use
> it.
> It will partially kill its online nature. We can circumvent that using a
> Document-Frequency Map to compute approximate tf-idf during online stage
>

I think that you misunderstand me a little bit, and I know that I am not
understanding what you are saying here.

The new vectorizer can definitely do IDF weighting and that definitely makes
it good as a driver for classifiers and clustering.

One important thing about the IDF weighting and conversion is that except
for the weights, the conversion to vector is stateless.  The same document
will convert to the same pattern of non-zero elements in the output vector.
 If you have a constant weight dictionary, then the same document will
convert to the exact same output vector no matter what.

Moreover, if you use adaptive weighting, the weights should be pretty close
to the actual weights after you have seen a few thousand documents.  If you
have a global estimate from a random sample of documents then the results
should be close to right no matter what.

So I don't understand the two comments that you make "hash sparse features
to a particular id
range and ... compute tfidf for just that portion" and  "partially kill its
online nature".  Can you explain these a little more?

Does "hashing to a particular id range" mean only hash some words and not
others?  Or does it mean hash to a sub-range of the output vector?  Why
would you do either of these?  Regardless of why, I think the answer may be
yes.  The first can be handled by only vectorizing some fields, the second
can be handled by passing a view of a sub-vector to the vectorizer instead
of passing the entire vector.  Again, though, since I don't understand why
you would need to do this, I think I misunderstand your question.

And how can we "kill the online nature" of a stateless algorithm?

>
> Actually, I am not seeing the big need at the moment to reverse engineer
> data, its good for debugging but not so necessary in production. Let
> prioritise on gettting this plugged in, and work on this after.
>

The major use cases here are cluster dumping and model export from logistic
regression.

I agree with the down-prioritization of the accurate dumping.

Re: producing vectors from composite documents

Reply via email to