Hi,

I am asking the ASN.1 community to clarify the precise definition
of TeletexString (T61String).

The public opinion on the TeletexString (T61String) ASN.1 type
is that it mostly obsolete and should not be used in new
ASN.1 specifications. However, this does not preclude one
from seeking a precise definition of the encoding, if only
for historic purposes.

Here is what I know so far. The bottom of this email contains
some questions. I kindly ask people responsible for developing
the ASN.1 compilers and actual ASN.1 based protocols to comment
on this and present their own view on this topic.

I should repeat myself there: I am perfectly aware that a sizable
volume of software in the world treats TeletexString (T61String)
as a simple 8-bit string with mostly Windows Latin 1 (superset of
iso-8859-1) encoding. However, this particular quest is for
a proper, precise and standards-based definition of TeletexString.

Here is what I have:

1. The TeletexString (T61String) has its roots in T.61 encoding,
but is no longer defined as being T.61 based. In addition to that,
the T.61 standard is withdrawn by ITU-T:
http://www.itu.int/rec/T-REC-T.61

2. The ASN.1 standard (X.680) specifies TeletexString (T61String)
as a combination of the character sets specified by the registration
numbers listed in ISO International Register of Coded Character Sets
to be used with Escape Sequences (ISO-2375):
6, 87, 102, 103, 106, 107, 126, 144, 150, 153, 156, 164, 165, 168,
plus SPACE and DELETE characters.
In addition to that, the X.680 Table 6 NOTE 2 allows using register
entries 6 and 156 instead of 102 and 103.

3. The ISO Register itself is available at
   http://www.itscj.ipsj.or.jp/ISO-IR/

4. The following are excerpts from the appropriate documents
found by ISO Register.

Reg.#6 is ASCII.
        Escapes into:
                G0: ESC 2/8 4/2 ("(B")
                G1: ESC 2/9 4/2 (")B")
        The range is [0x21 .. 0x7e]. Conversion into Unicode
        is simple, because it has one-to-one correspondence.

Reg.#87 is a "Japanese Graphic Character Set for Information Interchange".
        Is a multiple-byte set of 6877 characters.
        The character set is JIS X 0208-1983
        (originally JIS C 6226-1983).
        Escapes into:
                G0: ESC 2/4 4/2 ("$B")
                G1: ESC 2/4 2/9 4/2 ("$)B")
                G2: ESC 2/4 2/10 4/2 ("$*B")
                G3: ESC 2/4 2/11 4/2 ("$+B")

Reg.#102 is "Teletex Primary Set of Graphic Characters".
        Escapes into:
                G0: ESC 2/8 7/5 ("(u")
                G1: ESC 2/9 7/5 (")u")
                G2: ESC 2/10 7/5 ("*u")
                G3: ESC 2/11 7/5 ("+u")
        It is almost identical to ASCII, except for ASCII position
        for '$' (DOLLAR SIGN) is filled with '¤' (CURRENCY SIGN),
        which is U+00A4. Also, ASCII positions for '`', '\', '^', '{',
        '}', '~' are marked as "should not be used".

Reg.#103 is a supplementary set of characters used with #102.
        Escapes into:
                G0: ESC 2/8 7/6 ("(v")
                G1: ESC 2/9 7/6 (")v")
                G2: ESC 2/10 7/6 ("*v")
                G3: ESC 2/11 7/6 ("+v")
        Some characters in that character set are combining characters,
        which can only be restrictively used with certain basic Latin
        letters. It can be thought of as a subset of #156 with the
        exception of 4/12 which is UNDERLINE in #103 and absent in #156.

Reg.#106 is a primary set of control functions, used with #107.
        Escapes into:
                C0: ESC 2/1 4/5 ("!E")
        This set is so short I can list it here:
                0x08    BS      BACKSPACE       -- same as Unicode
                0x0a    LF      LINE FEED       -- same as Unicode
                0x0c    FF      FORM FEED       -- same as Unicode
                0x0d    CR      CARRIAGE RETURN -- same as Unicode
                0x0e    LS1     LOCKING SHIFT ONE
                0x0f    LS0     LOCKING SHIFT ZERO
                0x19    SS2     SINGLE SHIFT TWO
                0x1a    SUB     SUBSTITUTE CHARACTER
                0x1b    ESC     ESCAPE          -- same as Unicode
                0x1d    SS3     SINGLE SHIFT THREE
        The LS1 and LS0 are two magical functions which, respectively,
        invoke the currently designated G1 or G0 set into positions 2/1
        to 7/14 The SS2 and SS3, respectively, invoke one character of
        the currently designated set G2 and G3.
        The SUB is wholly equivalent to U+001a (SUBSTITUTE)

Reg.#107 is a supplementary set of control functions, used with #106.
        Escapes into:
                C1: ESC 2/2 4/8 ('"H')
        This set contains three special control codes:
                0x8b    PLD     PARTIAL LINE DOWN -- similar to <SUB>
                0x8c    PLU     PARTIAL LINE UP   -- sumilar to <SUP>
                0x9b    CSI     CONTROL SEQUENCE INTRODUCER
        PLD,PLU: this can not be adequately represented by Unicode.
        CSI: since TeletexString has fixed meaning in ASN.1, appearance
        of this code is allowed in the TeletexString, yet the semantics
        of its appearance is not specified. Hence, it is probably
        an error if CSI is present in the stream.

Reg.#126 is a "Right-hand Part of the Latin/Greek Alphabet".
        Comprises of 90 characters, including accented letters.
        Escapes into:
                G1: ESC 2/13 4/6 ("-F")
                G2: ESC 2/14 4/6 (".F")
                G3: ESC 2/15 4/6 ("/F")
        Note: This Registration is a subset of ISO-IR 227.

#144 is a "Cyrillic part of the Latin/Cyrillic Alphabet".
        Comprises of 95 characters.
        Escapes into:
                G1: ESC 2/13 4/12 ("-L")
                G2: ESC 2/14 4/12 (".L")
                G3: ESC 2/15 4/12 ("/L")

#150 is a "Greek Primary Set of Graphic Characters".
        Comprises of 94 characters.
        Escapes into:
                G0: ESC 2/8 2/1 4/0 ("(!@")
                G1: ESC 2/9 2/1 4/0 (")!@")
                G2: ESC 2/10 2/1 4/0 ("*!@")
                G3: ESC 2/11 2/1 4/0 ("+!@")

#153 is a "Basic Cyrillic Character Set for 8-bit codes".
        Comprises of 68 characters.
        Escapes into:
                G1: ESC 2/13 4/15 ("-O")
                G2: ESC 2/14 4/15 (".O")
                G3: ESC 2/15 4/15 ("/O")

#156 is a "Supplementary Set of ISO/IEC 6937:1992" for use with #6
        Comprises of 87 characters.
        Escapes into:
                G1: ESC 2/13 5/2 ("-R")
                G2: ESC 2/14 5/2 (".R")
                G3: ESC 2/15 5/2 ("/R")

#164 is a "Hebrew Supplementary Set of Graphic Characters"
        Comprises of 27 characters.
        Escapes into:
                G1: ESC 2/13 5/3 ("-S")
                G2: ESC 2/14 5/3 (".S")
                G3: ESC 2/15 5/3 ("/S")

#165 is a set of "Codes of the Chinese graphic character set"
        Is a multiple-byte set of 8446 characters.
        Escapes into:
                G0: ESC 2/4 2/8 4/5 ("$(E")
                G1: ESC 2/4 2/9 4/5 ("$)E")
                G2: ESC 2/4 2/10 4/5 ("$*E")
                G3: ESC 2/4 2/11 4/5 ("$+E")

#168 is a "Japanese Graphic Character Set for Information Interchange"
        A multiple-byte set of 6879 characters updated from #87.
        Escapes into:
                G0: ESC 2/6 4/0 ESC 2/4 4/2 ("&@" "$B")
                G1: ESC 2/6 4/0 ESC 2/4 2/9 4/2 ("&@" "$)B")
                G2: ESC 2/6 4/0 ESC 2/4 2/10 4/2 ("&@" "$*B")
                G3: ESC 2/6 4/0 ESC 2/4 2/11 4/2 ("&@" "$+B")


5. Questions

5.1 The Reg.#107 contains
                0x8b    PLD     PARTIAL LINE DOWN -- similar to <SUB>
                0x8c    PLU     PARTIAL LINE UP   -- sumilar to <SUP>
                0x9b    CSI     CONTROL SEQUENCE INTRODUCER
    however, since TeletexString (T61String) is not defined as a
    reference to ISO-2022, does it mean that CSI is not defined
    and should not appear?

5.2 The Reg.#106 defines locking shift functions, LS1, LS0 etc.
    My understanding is that these functions must do what they
    are supposed to do, that is, invoke G1/G0 into GL. Is that
    right?

5.3 The main question. What is the default state of GR and GR
    at the very beginning of the string?
    According to X.208 (I believe; I don't have this at hands),
    the default state for GL and GR is Reg.#102 and Reg.#103.
    However, this was just a reflection on the T.61 roots of
    the TeletexString type. The more modern T.61 (T.51/50) have
    subsequently explicitly defined IRV through the a) alphabet
    identical to Reg.#6 and b) through the escape sequence
    identical to that of Reg.#6. I assume this fact has been
    reflected in the X.680 Table 6 Note 2.
    Since the #102 and #6 are practically different only in the
    DOLLAR SIGN position, we can mentally integrate #6 and #102
    and #103 & #156, ignoring the undefined code points in
    either of them. However, the choice of the start encoding
    may affect the use of the dollar sign ($) versus
    currency sign (¤). Despite #6 and #102 being equal, according
    to X.608, there got to be something that is more equal.
    If we assume #6 is "more equal", then there is no controversy:
    both at the beginning of the string and during the sequence switch
    to #6 and #102 (explicitly) we always know whether 2/4 is dollar
    or currency sign.
    Hovewer, if we assume #103 is "more equal", then we have a problem,
    since the beginning of the string is defined as #6 in T.50/T.51,
    thus it can probably not be #103 at the beginning of the string,
    according to the latest state of standardization.

5.4 Can we assume CL has #106 and CR has #107 at the beginning
    of the string?


--
Lev Walkin
[EMAIL PROTECTED]
_______________________________________________
Asn1 mailing list
Asn1@asn1.org
http://lists.asn1.org/mailman/listinfo/asn1

Reply via email to