Yes it's a character issue; unfortunately these really aren't combining marks: they are separate characters (that look like apostrophes) intended to indicate stress in pronunciation (eg: re'bate), but we want to ignore them for the purposes of search.
The best idea I have now is to mark up as: <w display="re'bate">rebate</w> so we search the text and display the attribute, but I was hoping to find a solution that didn't rely on changing the documents, if possible. A long shot, but you never know :) -Mike On 07/29/2010 03:38 AM, Dave Pawson wrote: > On 28 July 2010 22:15, Mike Sokolov<[email protected]> wrote: > >> Stress marks (UTF8 712 and 716) seem to be treated as word-separators >> for the purposes of tokenization. This makes it impossible to search >> for words containing them (without actually entering the stress marks in >> the query). >> >> Is there any way to avoid this? Ie to generate indexes that act as if >> these characters were simply not present? >> >> Suppose we were to wrap these characters in an element of some sort - >> could we cause text on either side of the element to be merged into a >> single token (as with phrase-around)? >> > > Seems more like a kludge than a solution Mike? > Is there no way to write the combination as a single codepoint? > This seems like a character level issue rather than markup? > > > > > _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
