2011/1/26 Jörn Kottmann <[email protected]>:
>
> I tested that on the Leipzig Corpora for all languages, and generated
> all possible features. In the end I did not see a single hash collision.
>
> Even if there are collisions once in a while it might not harm the detection
> performance that much.

Natural languages are very redundant, and linear models can tolerate a
quite a high ratio of collisions before seeing the performance
degrade:

  http://hunch.net/~jl/projects/hash_reps/index.html

I think you can project to 1e6 dimensions with a hash function without
any problem in practice for text categorization. I don't know for
other kind of problems such as NER features.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to