It would be great to have n-gram lexicons as a server feature. Without having that, I think it could be built. I would try to use range indexes as much as possible, to leverage their existing features.
The design of the XML representation might need some careful thought and experiment. The value would be the n-gram itself, a sequence of words. It might be useful to have element names like ngram-1, ngram-2, etc. so that cts:element-values() can select by length. Likely you'll want to get n-grams for cts:query terms, so I would probably store the n-grams in the main document. This increases the document size but it seems like a desirable trade-off. The alternative is to join using something like document URIs. If an ngram occurs X times in the document there should be X ngram elements, so that cts:frequency can do its job. That increases the cost, but I think it's necessary. Unless it's possible to do something clever with a user-defined lexicon function? Ingestion seems like the natural time to create the n-grams, possibly using a CPF pipeline. The code for extracting the words themselves could use cts:tokenize and needn't be very complex, although it might run slowly. Hadoop might make sense if you already have a large database, or if it turns out that offloading the extra compute time is a net win. Would stemming be desirable? If so that makes it harder to benefit from hadoop - but maybe not impossible. -- Mike On 15 Jan 2013, at 16:44 , Alan Darnell <[email protected]> wrote: > I'm wondering if anyone has tried to create n-grams from a large body of XML > documents stored in MarkLogic? Single word n-grams can be derived from word > lexicons. But what about 2, 3, 4, or 5 word n-grams? Are there efficient > ways to do this, maybe using Hadoop perhaps, and then storing the results > back in the database so they can be used to look at frequencies over time > (as, for example, the Google Books n-gram viewer allows)? > > Alan > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
