It would be great to have n-gram lexicons as a server feature. Without having 
that, I think it could be built. I would try to use range indexes as much as 
possible, to leverage their existing features.

The design of the XML representation might need some careful thought and 
experiment. The value would be the n-gram itself, a sequence of words. It might 
be useful to have element names like ngram-1, ngram-2, etc. so that 
cts:element-values() can select by length.

Likely you'll want to get n-grams for cts:query terms, so I would probably 
store the n-grams in the main document. This increases the document size but it 
seems like a desirable trade-off. The alternative is to join using something 
like document URIs. If an ngram occurs X times in the document there should be 
X ngram elements, so that cts:frequency can do its job. That increases the 
cost, but I think it's necessary. Unless it's possible to do something clever 
with a user-defined lexicon function?

Ingestion seems like the natural time to create the n-grams, possibly using a 
CPF pipeline. The code for extracting the words themselves could use 
cts:tokenize and needn't be very complex, although it might run slowly. Hadoop 
might make sense if you already have a large database, or if it turns out that 
offloading the extra compute time is a net win.

Would stemming be desirable? If so that makes it harder to benefit from hadoop 
- but maybe not impossible.

-- Mike

On 15 Jan 2013, at 16:44 , Alan Darnell <[email protected]> wrote:

> I'm wondering if anyone has tried to create n-grams from a large body of XML 
> documents stored in MarkLogic?  Single word n-grams can be derived from word 
> lexicons.  But what about 2, 3, 4, or 5 word n-grams?  Are there efficient 
> ways to do this, maybe using Hadoop perhaps, and then storing the results 
> back in the database so they can be used to look at frequencies over time 
> (as, for example, the Google Books n-gram viewer allows)?
> 
> Alan
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to