Reflecting back 3-4 years now, I'm not sure what the problem was 
exactly.  I don't *think* it was regex related. We may have stumbled on 
the absence of a dynamic data structure like a perl hash in which to 
store the temporary data (there was no map:map at the time).  There may 
have been some issue performing the actual normalization; when we did 
this there was no diacritic-insensitivity in MarkLogic so we were 
processing all the accented characters and had to try hundreds of 
character transformations on every word (at least every word of interest).

I suspect an efficient implementation in xquery is possible.  We 
probably just didn't try hard enough :)

-Mike

On 08/30/2010 09:31 AM, David Sewell wrote:
> On Sun, 29 Aug 2010, Michael Sokolov wrote:
>
>    
>> We have used the approach of generating a thesaurus using a preprocessor that
>> finds all words with the set of characters we're interested in.  The results
>> are good in our use cases, but I do think it will be data-dependent since if
>> you have a large number of expansions in the thesaurus for a term, that would
>> probably be less performant. Also, (Jason you may remember this from a few
>> years back) implementing an efficient character-finding thesaurus generator 
>> in
>> xquery is quite difficult; we ended up writing that part in Perl.
>>      
> What made this difficult using XQuery? The less rich implementation of
> regular expressions compared to Perl, or something else?
>
>    
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to