[Boston.pm] wanted: perl code to do JAXB name mapping (LONG)

Tolkin, Steve Wed, 04 Dec 2002 07:04:30 -0800

Summary: I am looking for a program to do name mappping
as specified in Appendix C of the JAXB (Java XML Binding) spec.
This for example will map from foo_bar to fooBar etc. 
Although they talk about Java and XML names, this 
mapping applies to many other programming languages too.
In particular databases typically use the underscore 
character as the separator, and so this program would
would be very useful for that translation.


Note the careful treatment that locates the word break in front 
of an upper case letter followed by a lowercase letter 
e.g. FOOBar becomes FOO_BAR in the mapping to a constant.


Details:
$Id: jaxb_name_mapping.txt 1.3 2002/12/04 14:51:06 A071046 Exp $

[I quote from the following document, downloadable from Sun.  I only
quoted the first part of Appendix C - mapping XML name to Java
Identidier.  I also want a program to do the reverse mapping.  It was
in file jaxb-0_7-prd-spec.pdf.  After copying the text and pasting it
as plain ASCII I had to slightly edit this file, e.g. to align the
tables using spaces, add newlines, etc.  I lost many of the bullets in
the original and did not manually add them all back.]

<quote from = "
The Java(TM) Architecture for XML Binding (JAXB) Public Draft, V0.7
September 12, 2002 ">


C.1 Overview

This section provides default mappings from:

XML Name to Java identifier

Model group to Java identifier

Namepsace URI to Java package name

C.2 The Name to Identifier Mapping Algorithm

Java identifiers typically follow three simple, well-known
conventions:

Class and interface names always begin with an upper-case letter. The
remaining characters are either digits, lower-case letters, or
upper-case letters. Upper-case letters within a multi-word name serve
to identify the start of each non-initial word, or sometimes to stand
for acronyms.

Method names and components of a package name always begin with a
lower-case letter, and otherwise are exactly like class and interface
names.

Constant names are entirely in upper case, with each pair of words
separated by the underscore character ('_', \u005F, LOW LINE).

XML names, however, are much richer than Java identifiers: They may
include not only the standard Java identifier characters but also
various punctuation and special characters that are not permitted in
Java identifiers. Like most Java identifiers, most XML names are in
practice composed of more than one natural-language word. Non-initial
words within an XML name typically start with an upper-case letter
followed by a lower-case letter, as in Java, or are prefixed by
punctuation characters, which is not usual in Java and, for most
punctuation characters, is in fact illegal.

In order to map an arbitrary XML name into a Java class, method, or
constant identifier, the XML name is first broken into a word
list. For the purpose of constructing word lists from XML names we use
the following definitions:

A punctuation character is one of the following:

* A hyphen ('-', \u002D, HYPHEN-MINUS),
* A period ('.', \u002E, FULL STOP),
* A colon (':', \u003A, COLON),
* An underscore ('_', \u005F, LOW LINE),
* A dot ('.', \u00B7, MIDDLE DOT),
* \u0387, GREEK ANO TELEIA,
* \u06DD, ARABIC END OF AYAH, or
* \u06DE, ARABIC START OF RUB EL HIZB.

These are all legal characters in XML names.

A letter is a character for which the Character.isLetter method
returns true, i.e., a letter according to the Unicode standard. Every
letter is a legal Java identifier character, both initial and
non-initial.

A digit is a character for which the Character.isDigit method returns
true, i.e., a digit according to the Unicode Standard. Every digit is
a legal non-initial Java identifier character.

A mark is a character that is in none of the previous categories but
for which the Character.isJavaIdentifierPart method returns true. This
category includes numeric letters, combining marks, non-spacing marks,
and ignorable control characters.

Every XML name character falls into one of the above categories. We
further divide letters into three subcategories:

An upper-case letter is a letter for which the Character.isUpperCase
method returns true,

A lower-case letter is a letter for which the Character.isLowerCase
method returns true,and

All other letters are uncased.

An XML name is split into a word list by removing any leading and
trailing punctuation characters and then searching for word breaks. A
wordbreak is defined by three regular expressions: A prefix, a
separator, and a suffix. The prefix matches part of the word that
precedes the break, the separator is not part of any word, and the
suffix matches part of the word that follows the break. The word
breaks are defined as:


Table 3-1 XML Word Breaks

Prefix   Separator Suffix      Example

[^punct] punct+    [^punct]    foo|--|bar
digit              [^digit]    foo22|bar
[^digit]           digit       foo|22
lower              [^lower]    foo|Bar
upper              upper lower FOO|Bar
letter             [^letter]   Foo|\u2160
[^letter]          letter      \u2160|Foo

(The character \u2160 is ROMAN NUMERAL ONE, a numeric letter.)

After splitting, if a word begins with a lower-case character then its
first character is converted to upper case. The final result is a word
list in which each word is either

* A string of upper- and lower-case letters, the first character of
which is upper case,

* A string of digits, or

* A string of uncased letters and marks.

Given an XML name in word-list form, each of the three types of Java
identifiers is constructed as follows:

* A class or interface identifier is constructed by concatenating the
words in the list,

* A method identifier is constructed by concatenating the words in the
list.  A prefix verb (get, set, etc.) is prepended to the result.

* A constant identifier is constructed by converting each word in the
list to upper case; the words are then concatenated, separated by
underscores.

This algorithm will not change an XML name that is already a legal and
conventional Java class, method, or constant identifier, except
perhaps to add an initial verb in the case of a property access
method.

Example
Table 3-2 XML Names and Java Class, Method, and Constant Names

XML Name          Class Name      Method Name        Constant Name
mixedCaseName     MixedCaseName   getMixedCaseName   MIXED_CASE_NAME
Answer42          Answer42        getAnswer42        ANSWER_42
name-with-dashes  NameWithDashes  getNameWithDashes  NAME_WITH_DASHES
other_punct-chars OtherPunctChars getOtherPunctChars OTHER_PUNCT_CHARS

C.2.1 Collisions and conflicts

It is possible that the name-mapping algorithm will map two distinct
XML names to the same word list. This will result in a collision if,
and only if, the same Java identifier is constructed from the word
list and is used to name two distinct generated classes or two
distinct methods or constants in the same generated class. Collisions
are not permitted by the binding compiler and are reported as errors;
they may be repaired by revising XML name within the source schema or
by specifying a customized binding that maps one ot the two XML names
to an alternative Java identifer.  

Method names are forbidden to conflict with Java keywords or literals,
with methods declared in java.lang.Object, or with methods declared in
the binding-framework classes. Such conflicts are reported as errors
and may be repaired by revising the appropriate schema.
</quote>
 
Hopefully helpfully yours,
Steve
-- 
Steven Tolkin          [EMAIL PROTECTED]      617-563-0516 
Fidelity Investments   82 Devonshire St. V8D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates.
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

[Boston.pm] wanted: perl code to do JAXB name mapping (LONG)

Reply via email to