I'm wondering if anyone has tried to create n-grams from a large body of XML 
documents stored in MarkLogic?  Single word n-grams can be derived from word 
lexicons.  But what about 2, 3, 4, or 5 word n-grams?  Are there efficient ways 
to do this, maybe using Hadoop perhaps, and then storing the results back in 
the database so they can be used to look at frequencies over time (as, for 
example, the Google Books n-gram viewer allows)?

Alan
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to