Re: [MarkLogic Dev General] Best way to normalize non-ASCII characters to ASCII?

Michael Blakeley Fri, 01 Aug 2014 10:08:36 -0700

I don't know of a library or built-in that would handle that. But you could 
write one. If you do, try to release the source.

In another direction, one fairly cheap solution might be to check the user-name 
before creating the user, or try-catch the library call. If it looks iffy or 
fails, apologize to the user and ask them to asciify it in their own preferred 
way. That way there are no surprises when the user sees møøse automatically 
(and irrevocably?) translated to moeoese or moose. People are sometimes very 
sensitive about these things, so either variant might annoy someone.

A pre-flight check could use fn:matches with the pattern from security.xsd:

  <xs:simpleType name="user-name">
    <xs:annotation>
      <xs:documentation>
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:token">
      <xs:pattern value="[a-zA-Z0-9._@-]+"/>
      <xs:minLength value="1"/>
    </xs:restriction>
  </xs:simpleType>

Longer term you could ask MarkLogic to expand that pattern to cover more 
languages. It's only a matter of time before more users start wanting to use 
non-ASCII scripts for usernames. I'm not sure if there's any technical reason 
for the restriction. Using HTTP auth means user-id can't contain a colon ':', 
but otherwise I believe anything goes. Of course browsers might not support 
everything, and I'm not sure about LDAP, NTLM, etc.

-- Mike

On 1 Aug 2014, at 07:05 , David Sewell <[email protected]> wrote:

> We have a user-facing function that creates login names based on their real 
> names, using initial characters from their surname and last name to create a 
> MarkLogic user name. For the first time we recorded a server error when 
> someone registered with a name beginning with a non-ASCII character 
> (Norwegian Ø), because currently MarkLogic username cannot have non-ASCII 
> characters.
> 
> So I thought the easy solution would be to use xdmp:diacritic-less(). But no, 
> that only changes characters like ñ and é that are accented variants of a 
> single letter. It does not touch combined charaters like Ø or Æ.
> 
> Of course I could use fn:translate to catch all of the likely cases, but is 
> there a more general-purpose standard or extension function to perform 
> normalization to ASCII for accented/combined Latin characters in a MarkLogic 
> environment?
> 
> David S.
> 
> -- 
> David Sewell, Editorial and Technical Manager
> ROTUNDA, The University of Virginia Press
> PO Box 400314, Charlottesville, VA 22904-4314 USA
> Email: [email protected]   Tel: +1 434 924 9973
> Web: 
> http://rotunda.upress.virginia.edu/_______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Best way to normalize non-ASCII characters to ASCII?

Reply via email to