Re: [MarkLogic Dev General] Best way to normalize non-ASCII characters to ASCII?

David Sewell Fri, 01 Aug 2014 10:56:45 -0700

Earler today I submitted an RFE to MarkLogic Support for non-ASCII characters inMarkLogic usernames and it's now in MarkLogic's tracking system I was told.

The Norwegian who triggered the bug told me that Scandinavian Airlines' websitewon't let him use the actual spelling of his last name either, which made mefeel better. :-)


David

On Fri, 1 Aug 2014, Timothy W. Cook wrote:

I didn't realize non-ASCII characters weren't allowed in usernames.  This
is a serious shortcoming in ML.  Unless it can be fixed soon this may force
developers to use a mapping approach and something unique like a UUID for
the ML user name.


On Fri, Aug 1, 2014 at 2:07 PM, Michael Blakeley <[email protected]> wrote:

I don't know of a library or built-in that would handle that. But you
could write one. If you do, try to release the source.

In another direction, one fairly cheap solution might be to check the
user-name before creating the user, or try-catch the library call. If it
looks iffy or fails, apologize to the user and ask them to asciify it in
their own preferred way. That way there are no surprises when the user sees
møøse automatically (and irrevocably?) translated to moeoese or moose.
People are sometimes very sensitive about these things, so either variant
might annoy someone.

A pre-flight check could use fn:matches with the pattern from security.xsd:

  <xs:simpleType name="user-name">
    <xs:annotation>
      <xs:documentation>
      </xs:documentation>
      <xs:appinfo>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:token">
      <xs:pattern value="[a-zA-Z0-9._@-]+"/>
      <xs:minLength value="1"/>
    </xs:restriction>
  </xs:simpleType>

Longer term you could ask MarkLogic to expand that pattern to cover more
languages. It's only a matter of time before more users start wanting to
use non-ASCII scripts for usernames. I'm not sure if there's any technical
reason for the restriction. Using HTTP auth means user-id can't contain a
colon ':', but otherwise I believe anything goes. Of course browsers might
not support everything, and I'm not sure about LDAP, NTLM, etc.

-- Mike

On 1 Aug 2014, at 07:05 , David Sewell <[email protected]> wrote:

We have a user-facing function that creates login names based on their

real names, using initial characters from their surname and last name to
create a MarkLogic user name. For the first time we recorded a server error
when someone registered with a name beginning with a non-ASCII character
(Norwegian Ø), because currently MarkLogic username cannot have non-ASCII
characters.


So I thought the easy solution would be to use xdmp:diacritic-less().

But no, that only changes characters like ñ and é that are accented
variants of a single letter. It does not touch combined charaters like Ø or
Æ.


Of course I could use fn:translate to catch all of the likely cases, but

is there a more general-purpose standard or extension function to perform
normalization to ASCII for accented/combined Latin characters in a
MarkLogic environment?


David S.

--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web:

http://rotunda.upress.virginia.edu/_______________________________________________

General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general


--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville, VA 22904-4314 USA
Email: [email protected]   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Best way to normalize non-ASCII characters to ASCII?

Reply via email to