Re: [MarkLogic Dev General] tokenization

Dave Pawson Thu, 29 Jul 2010 00:39:06 -0700

On 28 July 2010 22:15, Mike Sokolov <[email protected]> wrote:
> Stress marks (UTF8 712 and 716) seem to be treated as word-separators
> for the purposes of tokenization.  This makes it impossible to search
> for words containing them (without actually entering the stress marks in
> the query).
>
> Is there any way to avoid this?  Ie to generate indexes that act as if
> these characters were simply not present?
>
> Suppose we were to wrap these characters in an element of some sort -
> could we cause text on either side of the element to be merged into a
> single token (as with phrase-around)?



Seems more like a kludge than a solution Mike?
Is there no way to write the combination as a single codepoint?
This seems like a character level issue rather than markup?




-- 
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] tokenization

Reply via email to