While not a machine learning problem, decomposing compound words
(marginalgrowth-> marginal growth) with Hadoop is useful in a
large search app? Lucene has DictionaryCompoundWordTokenFilter
however for a larger corpus it seems one would build the
dictionary first (i.e. build an index), then use the terms
dictionary to execute as the source for decomposing (and
probably not all the terms?).

http://www.google.com/search?q=marginalgrowth 41,100 results
http://www.google.com/search?q=marginal+growth 8,390,000 results
http://www.google.com/search?q="marginal+growth"; 41,100 results

Looks like they're decomposing the query into a phrase query.
Probably a key -> value lookup on marginalgrowth.

Reply via email to