[MarkLogic Dev General] tokenization

Mike Sokolov Wed, 28 Jul 2010 14:13:55 -0700

Stress marks (UTF8 712 and 716) seem to be treated as word-separators 
for the purposes of tokenization.  This makes it impossible to search 
for words containing them (without actually entering the stress marks in 
the query).


Is there any way to avoid this?  Ie to generate indexes that act as if 
these characters were simply not present?

Suppose we were to wrap these characters in an element of some sort - 
could we cause text on either side of the element to be merged into a 
single token (as with phrase-around)?

-Mike
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] tokenization

Reply via email to