RE: [MarkLogic Dev General] Unicode flattening for noncombinedcharacters.

Ian Small Mon, 21 Jul 2008 10:20:52 -0700

Another approach (a bit icky, though) would be to create a normalized form of 
the content during the ingestion process, search the normalized form, but then 
return the "original" form for display purposes.


This would be workable if all you were worried about was, for instance, 
<author> elements.  Not so much if you're trying to do general word-queries for 
the kind of phrases you use in your example.

Out of curiosity (not that I think this will help you, but it could help us 
help this problem in the future), is there a collation that matches the 
customer's requirements?

Cheers
ian 

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Danny Sokolsky
> Sent: Monday, July 21, 2008 10:14 AM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] Unicode flattening for 
> noncombinedcharacters.
> 
> Hi Peter,
> 
> I can think of a few ways to do this.  One idea is to use a 
> thesaurus and just add all of the terms to the thesaurus as 
> expansions of the terms with the funny characters.  It might 
> be hard to know all your terms before the search.
> 
> Another way is to just parse your search string for the 
> offending characters and change the search to an or-query of 
> the original term and the term with the replaced character.  
> I think this should work OK as long as there are not a huge 
> number of replaced characters, and as long as the search 
> strings are not very large.  Here is a hacky example of what 
> I mean--if you have a search parser already, something like 
> this would be relatively easy to add I think.
> 
> let $search := "Jacob Ørn"
> let $searchTokens := fn:tokenize($search, " ") let $replacedTokens :=
>   for $token in $searchTokens
>   return if ( fn:contains($token, "Ø") )
>          then (fn:replace($token, "Ø", "O") )
>          else ()
> return
>   cts:or-query((
>        cts:and-query((
>            for $tok in $searchTokens return cts:word-query($tok)  )),
>        cts:or-query((
>            for $orTok in $replacedTokens return 
> cts:word-query($orTok) ))
>       ))
> 
> I am sure there are other ways as well.  Hope this helps.
> 
> -Danny
> 
> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of 
> Peter Hickman
> Sent: Monday, July 21, 2008 6:52 AM
> To: MarkLogic ML
> Subject: [MarkLogic Dev General] Unicode flattening for non 
> combinedcharacters.
> 
> Our client has data such as "Jacob Ørn" that they want to search for. 
> The are expecting that searching for "orn" would match "Ørn" 
> as they see "Ø" as an accented character. According to the 
> Unicode Standard 4.0 (always a good read :)) U+00D8 "Latin 
> Capital Letter O With Stroke" is not a combined character and 
> therefore is not matched by "O" when doing a case and 
> diacritical insensitive search. This is what I expect and 
> understand as a developer.
> 
> Is there some way of getting client's expected behaviour? I 
> suspect that the "Ø" is only one of several characters that 
> have this problem, such as the "Ł" (U+0141) in "Łodz".
> 
> --
> Peter Hickman.
> 
> Semantico, Lees House, 21-23 Dyke Road, Brighton BN1 3FE
> t: 01273 358223
> f: 01273 723232
> e: [EMAIL PROTECTED]
> w: www.semantico.com
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> 
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] Unicode flattening for noncombinedcharacters.

Reply via email to