On Mon, Jan 18, 2010 at 3:20 PM, Olivier Grisel <olivier.gri...@ensta.org>wrote:
> > In the mean time could you please give me a hint on how to value the > probes of the binary randomizer w.r.t. the window size? > The basic trade-off is the standard hashed learning trade-off between number of training examples, dimensionality of the actual feature space, dimensionality of the reduced feature space, average non-zero feature count and number of probes. Heuristically speaking, I think that the reduced dimension^probe count > actual dimension is a good starting point, especially for small average non-zero feature count. For probe count > 5 or so, this probably is a bit pessimistic. > What is the impact of the allPairs feature? > It causes second order features consisting of all pairs of actual features to be considered. This increases average non-zero count k to increase to roughly k^2/2. That can be disastrous for convergence time if k is too large. > What is the impact of the window size? Which value make sense for > average english documents? > I think a window size of 1-3 makes lots of sense. Most long jargon phrases are either acronymed or composed of constituent bigrams and trigrams that are unique enough to make the point. > Do you think the terms should be stop-words / stemmed first? > I think not. Deleting stop words is not going to help accuracy with this kind of algorithm but deleting may help performance by decreasing average number of non-zero features. Stemming would, I think, hurt performance if applied in a full-blanket way. One enhancement I would like to see is to extend the input to a fielded document such as is used by Lucene. That would allow one document to have stemmed and non-stemmed text so that the algorithm could learn from whichever seems to help. It would also be interesting to have things like Author/Title word coocurrence pairs for disambiguating authors. It would also be interesting to have a field that consists of cooccurrence expanded text. > Do you think it would be worth to make the Randomizer API compute > hashes of NGrams (n>1) too? > Yes. That is what allPairs and window size is for. > How does the BinaryRandomizer compares to the DenseRandomizer > (performance wise and accuracy wise)? > DenseRandomizer is largely a performance optimization for the case where the number of probes equals the reduced dimensionality. It is probably of most use for the fabled upcoming SVD decomposition code rather than for SGD. Using the DenseRandomizer would destroy sparseness. With a large enough reduced dimension, that shouldn't hurt accuracy. Whether it would hurt or help performance is a very interesting question, and I couldn't really do more than speculate. If d is the reduced dimension, k the average number of non-zero features and p the number of probes, you can compute the number of non-zero features that would need updating on average. As k gets larger or d smaller, the number of non-zero reduced components as a fraction of d kind of saturates. The difficulty in guessing what is better is that where p = d, you can have d much smaller than where p = 2 or 3. As an example, for 100 words, d = 1,000,000 and k = 4 we will have roughly 400 non-zero features for the sparse update. For the dense update, my guess is that d = 300 would give us roughly equal performance and slightly fewer updates. I guess d = 300 because of the LSA guys experience. Your mileage *will* vary, of course. -- Ted Dunning, CTO DeepDyve