On Mon, Jan 18, 2010 at 3:20 PM, Olivier Grisel <olivier.gri...@ensta.org>wrote:

>
> In the mean time could you please give me a hint on how to value the
> probes of the binary randomizer w.r.t. the window size?
>

The basic trade-off is the standard hashed learning trade-off between number
of training examples, dimensionality of the actual feature space,
dimensionality of the reduced feature space, average non-zero feature count
and number of probes.

Heuristically speaking, I think that the reduced dimension^probe count >
actual dimension is a good starting point, especially for small average
non-zero feature count.  For probe count > 5 or so, this probably is a bit
pessimistic.


> What is the impact of the allPairs feature?
>

It causes second order features consisting of all pairs of actual features
to be considered.  This increases average non-zero count k to increase to
roughly k^2/2.  That can be disastrous for convergence time if k is too
large.


> What is the impact of the window size? Which value make sense for
> average english documents?
>

I think a window size of 1-3 makes lots of sense.  Most long jargon phrases
are either acronymed or composed of constituent bigrams and trigrams that
are unique enough to make the point.


> Do you think the terms should be stop-words / stemmed first?
>

I think not.  Deleting stop words is not going to help accuracy with this
kind of algorithm but deleting may help performance by decreasing average
number of non-zero features.

Stemming would, I think, hurt performance if applied in a full-blanket way.
One enhancement I would like to see is to extend the input to a fielded
document such as is used by Lucene.  That would allow one document to have
stemmed and non-stemmed text so that the algorithm could learn from
whichever seems to help.  It would also be interesting to have things like
Author/Title word coocurrence pairs for disambiguating authors.  It would
also be interesting to have a field that consists of cooccurrence expanded
text.


> Do you think it would be worth to make the Randomizer API compute
> hashes of NGrams (n>1) too?
>

Yes.  That is what allPairs and window size is for.


> How does the BinaryRandomizer compares to the DenseRandomizer
> (performance wise and accuracy wise)?
>

DenseRandomizer is largely a performance optimization for the case where the
number of probes equals the reduced dimensionality.  It is probably of most
use for the fabled upcoming SVD decomposition code rather than for SGD.
Using the DenseRandomizer would destroy sparseness.  With a large enough
reduced dimension, that shouldn't hurt accuracy.  Whether it would hurt or
help performance is a very interesting question, and I couldn't really do
more than speculate.  If d is the reduced dimension, k the average number of
non-zero features and p the number of probes, you can compute the number of
non-zero features that would need updating on average.  As k gets larger or
d smaller, the number of non-zero reduced components as a fraction of d kind
of saturates.  The difficulty in guessing what is better is that where p =
d, you can have d much smaller than where p = 2 or 3.

As an example, for 100 words, d = 1,000,000 and k = 4 we will have roughly
400 non-zero features for the sparse update.  For the dense update, my guess
is that d = 300 would give us roughly equal performance and slightly fewer
updates.  I guess d = 300 because of the LSA guys experience.

Your mileage *will* vary, of course.


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to