Now I am going to put some comments from the CNB classifier perspective, lets see how we can integrate things together
> Generally, Euclidean or L_1 distances are about all that makes sense for > these vectors. For clustering, I worry that I don't take IDF into account > (there is some provision for that in the AdaptiveWordEncoder, though). For > most learning applications, IDF shouldn't matter except that it might make > convergence faster by reducing the size of the largest eigenvalue. NB/CNB actually computes a sort of cluster average for a class and selects the one with the minimum cosine/dot product First the documents are normalized, then normalized sums of weights are computed instead of computing the word count. This is the key step which boosts the classification accuracy on text. I can move this to the document vectorizer. Once we have the tfidf vectors, CNB computes sum of weights for each feature and for each label and totally for the whole matrix. So it sort of looks like the 2x2 LLR computation table. The the actual feature weight in a class vector is calculated using these 4 values With this new vectorization, can we hash sparse features to a particular id range and ask the tfidf job to compute tfidf for just that portion?. This means, I can delete away the tfidf calculation code for CNB. This can exist as a separate vectorizer. And both clustering and classification can use it. It will partially kill its online nature. We can circumvent that using a Document-Frequency Map to compute approximate tf-idf during online stage > > > About the dictionary based trace. I need to actually see how the trace is > > useful. Do you keep track of the most important feature from those that > go > > into a particular hashed location?. > > > > Right now, I pretty much assume that there are no collisions. This isn't > always all that great an assumption. To get rid of that problem, it is > probably pretty easy to do a relaxation step where I generate an > explanation > for a vector and then generate the vector for that explanation. If there > are collisions, this last vector will differ slightly from the original and > the explanation of the difference should get us much closer to the > original. > > Actually, I am not seeing the big need at the moment to reverse engineer data, its good for debugging but not so necessary in production. Let prioritise on gettting this plugged in, and work on this after.
