While not a machine learning problem, decomposing compound words (marginalgrowth-> marginal growth) with Hadoop is useful in a large search app? Lucene has DictionaryCompoundWordTokenFilter however for a larger corpus it seems one would build the dictionary first (i.e. build an index), then use the terms dictionary to execute as the source for decomposing (and probably not all the terms?).
http://www.google.com/search?q=marginalgrowth 41,100 results http://www.google.com/search?q=marginal+growth 8,390,000 results http://www.google.com/search?q="marginal+growth" 41,100 results Looks like they're decomposing the query into a phrase query. Probably a key -> value lookup on marginalgrowth.
