Thanks Jim, these are good ideas if I'll need to extend the mappings myself because MarkLogic doesn't provide a more direct interface.

I would still love to hear from someone at ML how to retrieve the list of mappings that are provided by default, though.

--Liza

James A. Robinson wrote:
Is it possible to get a list of all mappings between characters with diacriticals and their "flattened" ASCII equivalents?

Similarly, is there a way to extend or modify this mapping in the current version of MarkLogic?

I'm new to MLS, so I don't know if there is a way to do it there.
Hopefully it is, I think it'd be a really nice feature w/re to content
mangement (e.g., being able to build pages friendly to older browsers
w/o having to go through a lot of external processing steps).

Failing that, in theory I think one could do this using either Java by
itself or using Java paired with XSLT or perhaps XQuery.

  http://www.w3.org/TR/xslt20/#element-output
  http://en.wikipedia.org/wiki/Unicode_normalization
  http://unicode.org/reports/tr15/#Decomposition

What I'm thinking you could do is load create a map of the characters you
want to flatten and process them with something which makes two copies
of special characters in attribute fields, output outputting everything
using  NFD encoding.  You then postprocess that with a program operating
on the byte level which can strip out the 'diacritic' characters from
one of the fields, leaving you with a mapping of the accented character
to a flattened version.

Attached below are two files, pmap.xsl and pmap.xml.  The .xml was built
by running Saxon against the .xsl file.  It should, I think, be possible
to then use Java (or C or C++ I suppose) to read in the .xml file using
a byte stream, and upon hitting the 'flattened' attribute, start dropping
bytes which aren't in the range 1-127, until you hit the quote.  It may or
may not be possible to do that in a nicer fashion using a real XML parser.
I suspect most parsers will consume the NFD form and turn it into UTF-16
or whatever they use to internally represent the characters.


Jim



------------------------------------------------------------------------


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson                       [EMAIL PROTECTED]
Stanford University HighWire Press      http://highwire.stanford.edu/
+1 650 7237294 (Work)                   +1 650 7259335 (Fax)


------------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to