On Tue, Jun 8, 2010 at 12:20 AM, Robin Anil <[email protected]> wrote:
> > > Generally, Euclidean or L_1 distances are about all that makes sense for > > these vectors. For clustering, I worry that I don't take IDF into > account > > (there is some provision for that in the AdaptiveWordEncoder, though). > For > > most learning applications, IDF shouldn't matter except that it might > make > > convergence faster by reducing the size of the largest eigenvalue. > > NB/CNB actually computes a sort of cluster average for a class and selects > the one with the minimum cosine/dot product > Dot product <=> Euclidean metric. Tfidf will definitely help a lot for this sort of thing. > First the documents are normalized, then normalized sums of weights are > computed instead of computing the word count. This is the key step which > boosts the classification accuracy on text. I can move this to the document > vectorizer. > And the idf weighting can be done on-line or in two passes. The two pass approach is more precise, but not necessarily very much. A compromise is also possible where the first pass is a small subset of documents (say 10,000 docs). That keeps it really fast and that dictionary can be used as the seed for the adaptive weighting (or just used directly). > With this new vectorization, can we hash sparse features to a particular id > range and ask the tfidf job to compute tfidf for just that portion?. This > means, I can delete away the tfidf calculation code for CNB. This can exist > as a separate vectorizer. And both clustering and classification can use > it. > It will partially kill its online nature. We can circumvent that using a > Document-Frequency Map to compute approximate tf-idf during online stage > I think that you misunderstand me a little bit, and I know that I am not understanding what you are saying here. The new vectorizer can definitely do IDF weighting and that definitely makes it good as a driver for classifiers and clustering. One important thing about the IDF weighting and conversion is that except for the weights, the conversion to vector is stateless. The same document will convert to the same pattern of non-zero elements in the output vector. If you have a constant weight dictionary, then the same document will convert to the exact same output vector no matter what. Moreover, if you use adaptive weighting, the weights should be pretty close to the actual weights after you have seen a few thousand documents. If you have a global estimate from a random sample of documents then the results should be close to right no matter what. So I don't understand the two comments that you make "hash sparse features to a particular id range and ... compute tfidf for just that portion" and "partially kill its online nature". Can you explain these a little more? Does "hashing to a particular id range" mean only hash some words and not others? Or does it mean hash to a sub-range of the output vector? Why would you do either of these? Regardless of why, I think the answer may be yes. The first can be handled by only vectorizing some fields, the second can be handled by passing a view of a sub-vector to the vectorizer instead of passing the entire vector. Again, though, since I don't understand why you would need to do this, I think I misunderstand your question. And how can we "kill the online nature" of a stateless algorithm? > > Actually, I am not seeing the big need at the moment to reverse engineer > data, its good for debugging but not so necessary in production. Let > prioritise on gettting this plugged in, and work on this after. > The major use cases here are cluster dumping and model export from logistic regression. I agree with the down-prioritization of the accurate dumping.
