Re: [edk2] [RFC] EDK II UNI Unicode File Specification

Hauch, Larry Tue, 19 Aug 2014 14:57:35 -0700

Hi Tim,

I agree with your proposal for #1. - See highlighting in the original RFC text 
(add and delete).


For #2 & 3, After looking through the UEFI specifications, the escape character 
sequences are not defined for the code, so I have removed the majority of them.
The only ones that are now listed in the EBNF are the ones that are currently 
processed by the EDK II build system (see 
Source\Python\AutoGen\UniClassObject.py).

Actually, I would like to see a proposal that would update both this spec and 
our tools for alternate methods for specifying the control codes listed in UEFI 
Spec v2.4B (section 29.2.6.2.4). Using something like "\" (a-zA-Z)+ ";" that 
uses the semi-colon as the terminator for the control code, or use hypertext 
markings like: "<" ["\"] (a-zA-Z)+ ">" as an additional method that must be 
supported by the EDI II Build tools. Since the tools have to convert these into 
the Double-Byte encodings specified in the UEFI Spec, they do need to be 
defined in a table.

I would also suggest that we do not add any unterminated control code sequences 
for any new content that must be supported. (Only \wide, \narrow and \nbr along 
with the standard \n, \r and \t escapes sequences would ever be non-terminated).

Cheers,
Larry


From: Tim Lewis [mailto:tim.le...@insyde.com]
Sent: Monday, August 18, 2014 4:17 PM
To: edk2-devel@lists.sourceforge.net
Subject: Re: [edk2] [RFC] EDK II UNI Unicode File Specification

Larry -

The description of the extensions for modules/package abstracts/description are 
much better.

Here are a few comments which are not specific to your update (although they 
are contained in the text below)


1.       It is readable. I do think that adding <> terminals for single 
characters makes it harder to read, but otherwise the text is clear. Why not 
"/" instead of <FS> and "(" instead of <LP>?

2.       I don't think there is any UEFI spec requirement that a \endbold be 
preceded by a \bold. Since the font for any string may include the bold 
attribute, it may be that the \endbold might be desirable. This is further 
complicated by the fact the the .UNI specification doesn't not provide 
font-select capabilities.

3.       The current escape character mechanism prevents future expansion, 
because the escape sequences are neither fixed length nor well-delimited. 
Consider what would happen if someone wanted to add \bolder to the grammar. 
This would make older strings suspect, since it could be interpreted as "\bold" 
 and "er" or "\bolder" I mentioned this long ago.

Tim

From: Hauch, Larry [mailto:larry.ha...@intel.com]
Sent: Monday, August 18, 2014 3:54 PM
To: edk2-devel@lists.sourceforge.net<mailto:edk2-devel@lists.sourceforge.net>
Subject: [edk2] [RFC] EDK II UNI Unicode File Specification

Hi Folks,

Here are the proposed changes to the EDK II UNI Unicode File Specification. 
Hopefully, HTML format for the chapters will be easier to review and respond 
with feedback.
Please provide feedback by the end of this week (22 Aug. 2014).


Updates:

*          Updated EBNF to follow syntax specified in EBNF by the ANTLR project

*          Added content related to EDK II Meta-Data Unicode files

*          Restructured document

*          Removed security and C format GUID definitions, not required for HII 
or other UNI files.

Cheers,
Larry
2
Unicode Strings File Format
EDK II Unicode files are used for mapping token names to localized strings that 
are identified by an RFC4646 language code. The format for storing EDK II 
Unicode files is UTF-16LE. The character content must be UCS-2.
Strings ends are determined by the first of the following items found:

*                     a control character

*                     a comment

*                     the end of the file

*                     a blank line
Comments may appear anywhere within the string file.
All the files must begin with a Unicode BOM character.

Note:    Please make sure you select an editor that supports UCS-2 characters 
that can be stored in a UTF-16LE file.

2.1               2. 1 Common EBNF
The following EBNF uses quoted (double quotes) encapsulated characters to 
represent UCS-2 string literals. In the following definitions, the semi-colon 
is used to denote a comment.


<US>                ::=  \u0020" "           ; Space Character

<FW>                ::=  \u0027           ; Forward Slash, /


<LP>                ::=  \u0028           ; Left Parenthesis, (

<RP>                ::=  \u0029           ; Right Parenthesis, )

<Letter>            ::=  {(\u0041-\u005A)} ; Characters A - Z
                         {(\u0061-\u007A)} ; Characters a - z

<Digit>             ::=  (\u0030-\u0039)   ; Characters 0 - 9

<UN>                ::=  \u005F            ; Underscore Character, _

<MS>                ::=  <US>+

<ME>                ::=  {<MS>} {<EOL>}

<CommentLine>       ::=  <FW> <FW>"//" <US>* <PCHars> <EOL>

<BlankLine>         ::=  <EOL>

<Chars>             ::=  (\u0001-\uF6FF)



<PChars>            ::=  (\u0020-\uF6FF)



<VChars>            ::=  (\u0021-\uF6FF)



<UnicodeLines>      ::=  <Token> <ME>

                         [<Ldef> [<String> <ME>]+]+



<Ldef>              ::=  <CtrlChar> "language" <MS> <LangCode> <ME>



<HexDigitU>         ::=  {<Digit>}

                         {(\u0041-\u0046)} ; Characters A - F

                         {(\u0061-\u0066)} ; Characters a - f



<CtrlChar>          ::=  \u0023"#"           ; Hash Character, #



<Token>             ::=  <CtrlChar> "string" <MS> <Identifier>



<Identifier>        ::=  <Letter> [{<Letter>} {<Digit>} {<UN>}]*



<LangCode>          ::=  <RFC4646>



<DH>                ::=  (\u002D)"-"         ; Dash Character, -



<RFC4646>           ::=  <Letter>{2,8} [<ShortExt> <LongExt>*]



<ShortExt>          ::=  <DH>"-" [{<Letter>} {<Digit>}]{1,8}



<LongExt>           ::=  <DH>"-" [{<Letter>} {<Digit>}]{1,}



<UDblQuote>         ::=  \u0022           ; Double Quote Character, "



<String>            ::=  <UDblQuote> <SContent>* <UDblQuote>



<SContent>          ::=  {<PChars>} {<Attributes>} {<CtrlCode>}



<Attributes>        ::=  <StartAttribute> <SContent>*

                        [<StopAttribute>]



<StartAttribute>    ::=  <AttrCtrlChar> <FontAttr>



<AttrCtrlChar>      ::=  \u005C"/"           ; Backslash Character, \



<StopAttribute>     ::=  <AttrCtrlChar> "end" <FontAttr>



<FontAttr>          ::=  {<SimpleAttrs>} {<StandardAttrs>}



<SimpleAttrsAttributes>       ::=  "\" {"narrow">} {"wide"} {UDblQuote}

                                   {"n"} {"r"} {"t"} {"nbr"} {"\"} {"'"}



<StandardAttrs>     ::=  {"normal"} {"bold"} {"italic"}

                         {"emboss"}

                         {"shadow"} {"underline"} {"dblunder"}



<CtrlCode>          ::=  <EscChar> {"n"} {"f"} {"r"} {"p"}

                         {"ospace"} {"enquad"} {"emquad"}

                         {"ensp"} {"emsp"} {"em3sp"} {"em4sp"}

                         {"em6sp"} {"usp"} {"tsp"} {"hsp"}

                         {"msp"} {"!bsp"} {"!nbsp"}

                         {"zsp"} {"ah"} { "hy"} { "df"} {"den"}

                         {"dem"} {"!bh"} {"g"} {"osp"} {"k"}



<EscChar>           ::=  \u005D"\"           ; Backslash Character, \





2.1.1              Definitions
LanguageCodes
The language code must be a valid RFC4646 language code.
EscChar
In order to include some standard characters, such as the "\" back-slash 
character within a string, the character must be prefixed with the escape 
character. Characters that may require a prefixed escape character include the 
following, back slash "\" character, single-quote "'" character, double-quote 
'"' character and the forward slash "/" character. The back slash always 
requires the escape character.
StandardAttrs
The standard font attribute, "normal" was not defined in the UEFI 
Specification; however it has been proposed and is included here. Additional 
attributes defined in the UEFI Specification, such as double underline 
(dblunder), did not have the double-byte encoding for the character mapping, 
however recommendations have been given for these characters (see
 below).
Token
The token (strong identifier) may only contain numbers, upper and lower case 
letters, underscore character, and dash character.
Include
An include line is used to parse another file, also compliant with this 
specification, as if it was in the file.  The tokens should not overlap between 
the file for the same language.


Table 1 HII Double-Byte Encoding Map

String


Double-Byte Encoding


String


Double-Byte Encoding

\bold

0xF620

\endbold

0xF621

\italic

0xF622

\enditalic

0xF623

\underline

0xF624

\endunderline

0xF625

\dblunder

0xF62A

\enddblunder

0xF62B5

\emboss

0xF6265

\endemboss

0xF6275

\shadow

0xF6285

\endshadow

0xF6295

\n (newline)

0x2028

\f (formfeed)

0x000C

\r (carriage return)

0x000D

\p (paragraph separator)

0x2029

\ospace (ogham space mark)

0x1680

\enquad

0x2000

\emquad

0x2001

\ensp (en space)

0x2002

\emsp

0x2003

\em3sp (three-per-em space)

0x2004

\em4sp

0x2005

\em6sp

0x2006

\usp (punctuation space)

0x2008

\tsp (thin space)

0x2009

\hsp (hair space)

0x200A

\msp (medium math space)

0x205F

\!bsp (no-break space)

0x00A0

\!nbsp (narrow no-break space)

0x0202F

\zsp (zero width space)

0x200B

\ah (Armenian hyphen)

0x058A

\hy (hyphen)

0x2010

\df (figure dash)

0x2012

\den (en dash)

0x2013

\dem (em dash)

0x2014

\!bh (non-breaking hyphen)

0x2011

\g (Tibetan mark intersyllabic tsheg)

0x0F0B

\osp (Ethiopic wordspace)

0x1361

\k (Khmer sign bariyoosan)

0x17D5



3
HII String Packs

Unicode files used for creating HII String Packs have the following format:
<StringFileFormat>  ::=  <CommentLine>*
                         <LanguageDefs>
                         <Content>+

The following EBNF describes content is specific to the Unicode files used for 
generating HII String Packs.
<Content>           ::=  {<CommentLine>}  {<BlankLine>}
                         {<UnicodeLines>} {<ControlRefactor>}
                         {<LanguageDefs>} {<SecurityLines>}
                         {<IncludeLines>}


Additional Definitions used for Unicode files used to create HII String Packs.
<LanguageDefs>      ::=  <CtrlChar> "langdef" <MS> <LangCode> <MS>
                         <LangDesc> <EOL>


<LangDesc>          ::=  <UDblQuote> <Chars> <UDblQuote>



<IncludeLines>      ::=  <CtrlChar> "include" <UniFile> <EOL>



<UniFile>           ::=  <UDblQuote> <UniFilename> <UDblQuote>



<UniFilename>       ::=  <FilenameChars> <MoreFNameChars>* {".uni"}
                         {".UNI"}



<FilenameChars>     ::=  {<Letter>} {<Digit>}



<MoreFNameChars>    ::=  {<Letter>} {<Digit>} {<UN>"_"}



<DefaultCtrlChar>   ::=  <FW>"/"



<EQ>                ::=  \u003D             ; Equal Character, =



<ControlRefactor>   ::=  <CtrlChar> <EQ>"=" <NewCtrlChar> <EOL>



<NewCtrlChar>       ::=  (0x0021 - 0xF6FF)

Note:    Unicode files that are used for generating HII String Packs are the 
only type of Unicode file that allows for refactoring the control character 
(providing backward compatibility), <CtrlChar>.

3.1               Example file:

//
// Cpu I/O Strings
//
// Copyright (c) 2006, Intel Corporation. All rights reserved.<BR>
//
// This program and the accompanying materials are licensed and made
// available under the terms and conditions of the BSD License which
// accompanies this distribution.  The full text of the license may be
// found at:
//    http://opensource.org/licenses/bsd-license.php
//
// THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
// WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS
// OR IMPLIED.
//

/=#

#langdef  en-US "English, US"
#langdef  fr-FR "Français"

#string STR_PROCESSOR_VERSION
#language en-US
"NT32 Emulated Processor"
#language fr-FR
"Processeur Émulé par NT32"



4                      Meta-Data UNI Files

In order to support distributions conforming to the UEFI PI Distribution 
Package Specification, Unicode files may be used to contain localization 
content passed along in the XML file for content that cannot be passed using 
ASCII characters.

Literal strings (encapsulated by double quotation marks) in the following ENBF 
represent UCS-2 encoded character strings.

The format of the Unicode files that contain the optional Module and Package 
localization content for distribution is as follows:

<MetaUniFile>  ::= <CommentLine>*

                   <MetaData>+



<MetaData>     ::= {<CommentLine>} {<BlankLine>} {<UnicodeLines>}

Additional Definitions used for Package Meta-Data <Identifier> entry of a 
<Token> used in the Unicode file.

<CtrlChar>     ::= "#"



<ErrorNumber> ::= <HexDigitU>{1,8}



<PcdName>     ::= <CName> <UN>"_" <CName>



<CName>       ::= <Letter>({<Letter>} {<Digit>})*

Refer to Chapter 2.1, Unicode Strings File Format, Extended Backus-Naur Form 
(EBNF) for the definitions of CommentLine, BlankLine and UnicodeLines.

It is also recommended that the comment section at the start of the files 
(described in the following sections of this chapter) use content consistent 
with content described for EDK II meta-data file headers, including a start tag 
line, "// @file", and include an abstract, description, copyright and license 
information.

4.1               4.1 Module Meta-Data

If a Module's INF file contains a MODULE_UNI_FILE entry in its [Defines] 
section, then the Unicode file specified may contain localization extensions 
for information found in the Module's Abstract, Description, Copyright and 
Licenses part of the @file header in described in the "EDK II Module 
Information (INF) File Specification".

The following <Identifier> entries are reserved for extending the Module's 
Abstract and Description content.

"STR_MODULE_ABSTRACT"

"STR_MODULE_DESCRIPTION"

If a Module's INF file contains a Unicode file entry in its 
[UserExtensions.TianoCore."ExtraFiles"] section, then that Unicode file may 
contain a localized version of a name for the module as well as other content. 
This file is used to hold content that is not required by UEFI PI Distribution 
Package, but may be useful for User Interface tools.

The following <Identifier> may be used to extend the name of the module.

"STR_PRORPERTIES_MODULE_NAME"

Other content may be provided in this file as the file itself will be carried 
along with the Module in a UEFI PI Distribution Package.

4.2               4.2 Package Meta-Data

If a Package's DEC file contains a PACKAGE_UNI_FILE entry in its [Defines] 
section, , then the Unicode file specified may contain localization extensions 
for information found in the Module's Abstract, Description, Copyright and 
Licenses part of the @file header in described in the "EDK II Package 
Declaration (DEC) File Specification". It may also contain content relevant to 
PCDs declared in the package.

The following <Identifier> entries are reserved for extending the Package's 
Abstract and Description content.

"STR_PACKAGE_ABSTRACT"

"STR_PACKAGE_DESCRIPTION"

The following <Identifier> is reserved for extending the localization of a 
Token Space GUID's error messages that are referenced by an error number. The C 
Name is the Token Space GUID's C Name declared in the DEC file's [Guids] 
section.

"STR_" <CName> "_ERR_" <ErrorNumber>

The following <Identifier> entries are reserved for extending the localization 
of a PCD's @HELP and @PROMPT content.

"STR_" <PcdName> "_HELP"

"STR_" <PcdName> "_PROMPT"

If a Package's DEC file contains a Unicode file entry in its 
[UserExtensions.TianoCore."ExtraFiles"] section, then that Unicode file may 
contain a localized version of a name for the package as well as other content. 
This file is used to hold content that is not required by UEFI PI Distribution 
Package, but may be useful for User Interface tools.

The following <Identifier> may be used to extend the name of the package.

"STR_PRORPERTIES_PACKAGE_NAME"

Other content may be provided in this file as the file itself will be carried 
along with the Package in a UEFI PI Distribution Package.

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [RFC] EDK II UNI Unicode File Specification

Reply via email to