Re: [MarkLogic Dev General] unicode character class 'Letter'

Mike Sokolov Fri, 04 Jun 2010 08:11:47 -0700

Thanks, Mary: glad to know you're on top of it!

-Mike


On 06/03/2010 10:39 PM, Mary Holstege wrote:
> Quite right you are.  There were some errors in our Unicode tables,
> which were also built against an older version of the Unicode standard.
> As of MLS 4.2: matches("&#x2cc;","\p{L}") =>  true()
>
> //Mary
>
> On Thu, 03 Jun 2010 17:03:23 -0700, Michael Sokolov<[email protected]>
> wrote:
>
>    
>> I ran across an anomaly in MarkLogicb this week while trying to evaluate
>> a
>> regular expression replacement using the Letter class:
>>
>> replace ($string, "\P{L}", "")
>>
>> Some characters which are classed as letters AFAICT according to Unicode
>> are
>> not treated as letters by MarkLogic.  For example,&#x2cc;, "MODIFIER
>> LETTER
>> LOW VERTICAL LINE" is treated as a non-letter.
>>
>> This link spells out the details:
>> http://www.fileformat.info/info/unicode/char/02cc/index.htm
>>
>> I wouldn't even have noticed if it weren't for the fact that Saxon did
>> something different from ML - and I think Java would do the same (based
>> on
>> the evidence on the link above, I haven't tested myself) - in Saxon I
>> had to
>> use the "modifier letter" class: \P{Lm} to remove these characters.
>>
>> I have to say, it doesn't look like a letter to me (it's a little line -
>> a
>> stress marker): MarkLogic performed as I was expecting, at first, but
>> that's
>> only because I am not a walking Unicode standard.  I think I'd prefer it
>> if
>> ML adhered closely to the UC standard in cases like this, even if it's
>> counterintuitive, if only so that it would behave the same as other
>> standards-compliant software.
>>
>> -Mike
>>
>>      
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>    
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] unicode character class 'Letter'

Reply via email to