On Sat, Sep 25, 2010 at 1:57 PM, Robin Anil <[email protected]> wrote:
> I currently call it in tf job or idf job at the end when merging the > partial > vectors. This throws away the feature counting and tfidf jobs in naive > bayes. Now all I need is to port the weight summer, and weight > normalization > jobs. Just two jobs to create the model from tfidf vectors. > Reasonable approach. With the sgd code, I avoid an IDF computation by using an annealed per term feature learning rate. If this annealing goes with 1/n where n is the number of instances seen so far, the final sum is ~ log N where N is the total number of occurrences. That saves a pass through the data which, when you are doing online learning, is critical. > > Or > > The naive bayes can generate the model from the vectors generated from > Hashed Feature vectorizer > > Multi field documents can generate a word feature = Field + Word. And use > dictionary vectorizer or Hash feature vectorizer to convert that to > vectors. > I say let there be collisions. Since increasing the number of bits can > decrease the collision, VW takes that approach. Let the people who worry > increase the number of bits :) > I also provide the ability to probe the vector more than once. This makes smaller vectors much more usable in the same way that Bloom filters can use smaller, over-filled bit vectors. In production, we clearly see a few cases of collisions when inspecting the dissected models, but very rarely.
