I'm wondering if anyone has tried to create n-grams from a large body of XML documents stored in MarkLogic? Single word n-grams can be derived from word lexicons. But what about 2, 3, 4, or 5 word n-grams? Are there efficient ways to do this, maybe using Hadoop perhaps, and then storing the results back in the database so they can be used to look at frequencies over time (as, for example, the Google Books n-gram viewer allows)?
Alan _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
