Re: [MarkLogic Dev General] tokenization

Mike Sokolov Thu, 29 Jul 2010 06:27:26 -0700

Yes it's a character issue; unfortunately these really aren't combining 
marks: they are separate characters (that look like apostrophes) 
intended to indicate stress in pronunciation (eg: re'bate), but we want 
to ignore them for the purposes of search.


The best idea I have now is to mark up as:

<w display="re'bate">rebate</w>

so we search the text and display the attribute, but I was hoping to 
find a solution that didn't rely on changing the documents, if 
possible.  A long shot, but you never know :)

-Mike

On 07/29/2010 03:38 AM, Dave Pawson wrote:
> On 28 July 2010 22:15, Mike Sokolov<[email protected]>  wrote:
>    
>> Stress marks (UTF8 712 and 716) seem to be treated as word-separators
>> for the purposes of tokenization.  This makes it impossible to search
>> for words containing them (without actually entering the stress marks in
>> the query).
>>
>> Is there any way to avoid this?  Ie to generate indexes that act as if
>> these characters were simply not present?
>>
>> Suppose we were to wrap these characters in an element of some sort -
>> could we cause text on either side of the element to be merged into a
>> single token (as with phrase-around)?
>>      
>
> Seems more like a kludge than a solution Mike?
> Is there no way to write the combination as a single codepoint?
> This seems like a character level issue rather than markup?
>
>
>
>
>    
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] tokenization

Reply via email to