Reflecting back 3-4 years now, I'm not sure what the problem was exactly. I don't *think* it was regex related. We may have stumbled on the absence of a dynamic data structure like a perl hash in which to store the temporary data (there was no map:map at the time). There may have been some issue performing the actual normalization; when we did this there was no diacritic-insensitivity in MarkLogic so we were processing all the accented characters and had to try hundreds of character transformations on every word (at least every word of interest).
I suspect an efficient implementation in xquery is possible. We probably just didn't try hard enough :) -Mike On 08/30/2010 09:31 AM, David Sewell wrote: > On Sun, 29 Aug 2010, Michael Sokolov wrote: > > >> We have used the approach of generating a thesaurus using a preprocessor that >> finds all words with the set of characters we're interested in. The results >> are good in our use cases, but I do think it will be data-dependent since if >> you have a large number of expansions in the thesaurus for a term, that would >> probably be less performant. Also, (Jason you may remember this from a few >> years back) implementing an efficient character-finding thesaurus generator >> in >> xquery is quite difficult; we ended up writing that part in Perl. >> > What made this difficult using XQuery? The less rich implementation of > regular expressions compared to Perl, or something else? > > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
