2011/1/26 Jörn Kottmann <[email protected]>: > > I tested that on the Leipzig Corpora for all languages, and generated > all possible features. In the end I did not see a single hash collision. > > Even if there are collisions once in a while it might not harm the detection > performance that much.
Natural languages are very redundant, and linear models can tolerate a quite a high ratio of collisions before seeing the performance degrade: http://hunch.net/~jl/projects/hash_reps/index.html I think you can project to 1e6 dimensions with a hash function without any problem in practice for text categorization. I don't know for other kind of problems such as NER features. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
