Thanks as always to Michael and Geert. Lots of good ideas we will pursue. I'd like to compare n-gram calculation speed with Hadoop versus other techniques. Ideally this would be a built-in indexing option for certain elements or attributes or for the word index. It would allow for the creating of very powerful tools for text mining and the new widget tools could be used to display results. More and more of our researchers want these derivative products and not necessarily the full-corpus.
Alan On Jan 16, 2013, at 3:05 PM, "Geert Josten" <[email protected]> wrote: > I experimented with this. I used code to do this in the demojam at > XMLPrague last year. I had an app that collected tweets, and show lots of > statistics for query results. I constructed word n-grams of length of 1, 2 > and 3 for this. I calculated them at ingest time without CPF or Hadoop, > though I did ingest the tweets in transactions. The n-gram calculation > time for the short tweet text is very short, however. And the size impact > is relatively large, but the xml for one tweet is such small that with the > n-grams it is still pretty small for MarkLogic. > > I pretty much did what Michael suggests. I tried using the Word lexicon > first, because it has built-in stemming support. But we needed something > to make important, larger word combinations stand out, and rank higher > than the same frequency for a single word. So indeed, I decided to use a > range index for this. At first I put all lengths in one index, but it was > convenient for the ranking calculation to be able to make the distinction. > So I created n-gram1, n-gram2 and n-gram3 elements, and indexed them > 'separately' (actually, comma-separated list of element names in one > index, but you can single them out via the element name). You can still > easily search the combination by simply supplying multiple element names > as a sequence of element names to the first param of the cts:element-* > functions. > > I was able to choose where to put the n-grams myself. I stored them in the > main doc fragment, which could save you a properties fragment. Storing > them in the properties should work as well though. > > There were a few caveats though. I tokenized and recombined the words as > they were at first, but it makes a lot of sense to apply stemming. Range > indexes don't provide stemming support though, so you have to do some > tricks yourself. I therefore applied cts:stem before inserting the values > into the n-gram elements. I obviously did that on the tokenized words, not > on the recombined n-grams. It worked reasonably well, considering the fact > that the tweets I collected were in many languages, and I was assuming > English for stemming for all of them. (Actually, I had some code to guess > the language reasonably accurately using a few language corpuses , but > never got around implementing it in my demojam code..) > > The second challenge for me was calculating a useful rank. You could use > the plain frequency, but the problem is that word combinations are > obviously much rarer, though more important. I therefore calculated a rank > of my one, using a very simple formula. The rank is equal to the frequency > times the square of the length of the n-gram (r = f * n * n). The downside > is that this means in-memory sorting. That was another good reason for me > to want separate elements for 1-, 2- and 3-grams. I take the top x of each > of them. Apply arithmetic on the lot, and sort all of them. Then take top > x again of this sorted sequence. > > Unfortunately, stop words were polluting my top score list. I had to pull > some more tricks to derive a list of stop words, and filter out such words > before calculating the rank. You can derive stop words by taking the top > occurring 1-grams. You need a sufficiently large corpus to get some > accuracy, and need to carefully select a threshold. I decided to rely on a > relative threshold, comparing frequencies with the size of the corpus. Any > word occurring in more than 10% of the corpus could well be stop word. The > fun thing of deriving the stop words from the actual contents is that the > stop words list becomes domain specific. If you would apply it on the > messages of this mailing list, then something like 'MarkLogic' would be > filtered out as well. Do you want that? I think so. The more often a word > occurs, the less useful it becomes to differentiate between docs. I > understood that the relevance formula in MarkLogic works in a similar way. > That is also why you don't need to worry about stop words, when searching > full text in MarkLogic. > > I could share bits of code if you like, but would have to dig them up > first.. > > Cheers, > Geert > >> -----Oorspronkelijk bericht----- >> Van: [email protected] [mailto:general- >> [email protected]] Namens Michael Blakeley >> Verzonden: woensdag 16 januari 2013 19:28 >> Aan: MarkLogic Developer Discussion >> Onderwerp: Re: [MarkLogic Dev General] n-gram calculation >> >> It would be great to have n-gram lexicons as a server feature. Without >> having that, I think it could be built. I would try to use range indexes > as >> much as possible, to leverage their existing features. >> >> The design of the XML representation might need some careful thought and >> experiment. The value would be the n-gram itself, a sequence of words. > It >> might be useful to have element names like ngram-1, ngram-2, etc. so > that >> cts:element-values() can select by length. >> >> Likely you'll want to get n-grams for cts:query terms, so I would > probably >> store the n-grams in the main document. This increases the document size >> but it seems like a desirable trade-off. The alternative is to join > using >> something like document URIs. If an ngram occurs X times in the document >> there should be X ngram elements, so that cts:frequency can do its job. >> That increases the cost, but I think it's necessary. Unless it's > possible to do >> something clever with a user-defined lexicon function? >> >> Ingestion seems like the natural time to create the n-grams, possibly > using a >> CPF pipeline. The code for extracting the words themselves could use >> cts:tokenize and needn't be very complex, although it might run slowly. >> Hadoop might make sense if you already have a large database, or if it > turns >> out that offloading the extra compute time is a net win. >> >> Would stemming be desirable? If so that makes it harder to benefit from >> hadoop - but maybe not impossible. >> >> -- Mike >> >> On 15 Jan 2013, at 16:44 , Alan Darnell <[email protected]> > wrote: >> >>> I'm wondering if anyone has tried to create n-grams from a large body > of >> XML documents stored in MarkLogic? Single word n-grams can be derived >> from word lexicons. But what about 2, 3, 4, or 5 word n-grams? Are > there >> efficient ways to do this, maybe using Hadoop perhaps, and then storing >> the results back in the database so they can be used to look at > frequencies >> over time (as, for example, the Google Books n-gram viewer allows)? >>> >>> Alan >>> _______________________________________________ >>> General mailing list >>> [email protected] >>> http://developer.marklogic.com/mailman/listinfo/general >>> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
