Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Kinney, Michael D Tue, 05 May 2015 11:36:06 -0700

Jordan,

Here is the link to the EDK II spec that states that UNI files must be UTF16-LE


http://sourceforge.net/projects/edk2/files/Specifications/UNI_File_Spec_v1_2_Errata_A.pdf/download

I think there are good reasons to verify that the entire UNI file only contains 
UCS-2 characters.  Editors are usually in a single Unicode mode, so it would be 
difficult when using an editor in an extended mode to know if only the string 
contents are UCS-2.  Also, if we decide to add helper tools to support 
conversion to UTF-16LE to follow current the EDK II UNI specification for 
releases or compatibility with other tools, then what would that helper tool do 
with a non UCS-2 character?  It is just easier for the whole file to use the 
same character set.

Best regards,

Mike


-----Original Message-----
From: Justen, Jordan L 
Sent: Tuesday, May 05, 2015 10:53 AM
To: Kinney, Michael D; edk2-devel@lists.sourceforge.net
Cc: Liu, Yingke D
Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

On 2015-05-05 09:07:11, Kinney, Michael D wrote:
> Jordan,
> 
> If we are going to add support for more UNI file formats, there are
> also EDK II specifications that must be updated.

I don't know about that. I at least looked under
BaseTools/UserManuals, and didn't find anything obvious to update.

I'm not sure who owns updating other specs, but they can probably make
the changes to the spec when new features appear in a code base. (The
old UTF-16 support should still work as spec'd.) Wouldn't this be more
of a concern for future UDK releases?

> I am not sure I agree with only checking that the string value has
> supported Unicode characters. If the Name or Language elements have
> unsupported Unicode characters, then that will cause problems too. I
> think I would prefer the entire file, including all comment
> lines/blocks to only contain the supported Unicode characters.

If you are referring to comment in the UTF-8 version of the file, why
constrain them? Obviously there are technical reasons for constraining
string values, but the same need not apply to comments.

> I like the addition of OpenUniFile() so we have one place to update

... open place, except apparently UPT has duplicated most of this
code. :)

-Jordan

> if we decide to add support for more file formats. Please look at
> the code fragments below that I think can simplify the logic and
> make it more readable/maintainable by taking advantage of more of
> the codecs module functions and constants. These code fragments are
> not based on the current trunk, so there may be some unexpected
> differences.
> 
> The codecs module has some constants that can improve the
> readability of this logic. The following code fragment detects the
> BOM marker and determines the encoding.
> 
>             #
>             # Read file
>             #
>             try:
>                 FileIn = open (LongFilePath(File.Path), mode='rb').read()
>             except:
>                 EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)   
>          
> 
>             #
>             # Detect Byte Order Mark at beginning of file.  Default to UTF-8
>             #
>             Encoding = 'utf-8'
>             if FileIn.startswith (codecs.BOM_UTF16_BE):
>                 Encoding = 'utf-16be'
>                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE)
>             elif FileIn.startswith (codecs.BOM_UTF16_LE):
>                 Encoding = 'utf-16le'
>                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE)
>             elif FileIn.startswith (codecs.BOM_UTF8):
>                 Encoding = 'utf-8-sig'
>                 FileIn = FileIn.lstrip (codecs.BOM_UTF8)
> 
> The following code fragment uses the codecs module and the encoding detected 
> above to verify that all the characters in a UNI file are legal UCS-2 
> characters.  If an invalid character is detected, then additional logic is 
> run in an except clause to determine the line number of the invalid 
> character.  
> 
>             #
>             # Convert to unicode
>             #
>             try:
>                 FileIn = codecs.decode (FileIn, Encoding)
>                 Verify = codecs.encode (FileIn, 'utf-16')
>                 Verify = codecs.decode (Verify, 'utf-16')
>             except:
>                 FileIn = codecs.open (LongFilePath(File.Path), 
> encoding=Encoding, mode='r')
>                 LineNumber = 0
>                 while True:
>                     LineNumber = LineNumber + 1
>                     try:
>                         Line = FileIn.readline()
>                         if Line == '':
>                             EdkLogger.error('Unicode File Parser', 
> PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path))
>                         Line = codecs.encode (Line, 'utf-16')
>                         Line = codecs.decode (Line, 'utf-16')
>                     except:
>                         EdkLogger.error('Unicode File Parser', PARSER_ERROR, 
> '%s contains invalid UCS-2 character at line %d.' % (File.Path, LineNumber))
> 
> Best regards,
> 
> Mike
>         
> -----Original Message-----
> From: Justen, Jordan L 
> Sent: Tuesday, May 05, 2015 12:09 AM
> To: edk2-devel@lists.sourceforge.net
> Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D
> Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files
> 
> Since UEFI only support UTF-16LE strings internally, this simply
> allows for another unicode the source file encoding.
> 
> The strings are still converted to UTF-16LE data for use in EDK II
> source code.
> 
> When .uni files contain UTF-16 data, it is impossible for unicode code
> points to be larger than 0xFFFF. To support .uni files that contain
> UTF-8 data, we also need to also deal with the possibility that the
> UTF-8 file contains unicode code points larger than 16-bits. Since
> UEFI only supports 16-bit string data, we make UniClassObject generate
> an error if a larger code point is seen in a UTF-8 string value.
> 
> We only check string value data, so it is possible to use larger code
> points in comments.
> 
> v2:
>  * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
>  * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
>    commit
> 
> Cc: Yingke D Liu <yingke.d....@intel.com>
> Cc: Michael D Kinney <michael.d.kin...@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com>
> ---
>  BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 
> +++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py 
> b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> index aa54f4f..41448ab 100644
> --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
> +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> @@ -209,7 +209,7 @@ class UniFileClassObject(object):
>          Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
>          if len(Lang) != 3:
>              try:
> -                FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> encoding='utf-16').read()
> +                FileIn = self.OpenUniFile(LongFilePath(File.Path))
>              except UnicodeError, X:
>                  EdkLogger.error("build", FILE_READ_FAILURE, "File read 
> failure: %s" % str(X), ExtraData=File);
>              except:
> @@ -253,6 +253,38 @@ class UniFileClassObject(object):
>                      self.OrderedStringDict[LangName][Item.StringName] = 
> len(self.OrderedStringList[LangName]) - 1
>          return True
>  
> +    def OpenUniFile(self, FileName):
> +        Encoding = 'utf-8'
> +        UniFile = open(FileName, 'rb')
> +
> +        #
> +        # Seek to end of file to determine its size
> +        #
> +        UniFile.seek(0, 2)
> +        FileSize = UniFile.tell()
> +
> +        if FileSize >= 2:
> +            #
> +            # Seek to start of the file to read the UTF-16 BOM
> +            #
> +            UniFile.seek(0, 0)
> +            Bom = UniFile.read(2)
> +            UniFile.seek(0, 0)
> +
> +            if Bom == '\xff\xfe':
> +                Encoding = 'utf-16'
> +
> +        Info = codecs.lookup(Encoding)
> +        return codecs.StreamReaderWriter(UniFile, Info.streamreader, 
> Info.streamwriter)
> +
> +    def Verify16bitCodePoints(self, String):
> +        for cp in String:
> +            if ord(cp) > 0xffff:
> +                tmpl = 'The string {} defined in file {} ' + \
> +                       'contains a character with a code point above 0xFFFF.'
> +                error = tmpl.format(repr(String), self.File)
> +                EdkLogger.error('Unicode File Parser', FORMAT_INVALID, error)
> +
>      #
>      # Get String name and value
>      #
> @@ -274,6 +306,7 @@ class UniFileClassObject(object):
>                  Language = LanguageList[IndexI].split()[0]
>                  Value = 
> LanguageList[IndexI][LanguageList[IndexI].find(u'\"') + len(u'\"') : 
> LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'')
>                  Language = GetLanguageCode(Language, self.IsCompatibleMode, 
> self.File)
> +                self.Verify16bitCodePoints(Value)
>                  self.AddStringToList(Name, Language, Value)
>  
>      #
> @@ -305,7 +338,7 @@ class UniFileClassObject(object):
>              EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, 
> ExtraData=File.Path)
>  
>          try:
> -            FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> encoding='utf-16')
> +            FileIn = self.OpenUniFile(LongFilePath(File.Path))
>          except UnicodeError, X:
>              EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: 
> %s" % str(X), ExtraData=File.Path);
>          except:
> @@ -426,6 +459,7 @@ class UniFileClassObject(object):
>                      MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE)
>                      if MatchString == None or MatchString.end(0) != 
> len(Name):
>                          EdkLogger.error('Unicode File Parser', 
> FORMAT_INVALID, 'The string token name %s defined in UNI file %s contains the 
> invalid lower case character.' %(Name, self.File))
> +                self.Verify16bitCodePoints(Value)
>                  self.AddStringToList(Name, Language, Value)
>                  continue
>  
> -- 
> 2.1.4
> 
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

Reply via email to