Jordan, The functionality of the patch set looks good. Just a few comments below about needing some better comments.
Reviewed-by: Michael Kinney <michael.d.kin...@intel.com> We also discussed the need for an extra tool that can scan a workspace or a package for UNI files and convert them to UTF-16LE. This tool may be required when releases are generated to guarantee compatibility with other tools that operate on UNI files that require UNI files to be UTF-16LE. Do you think this helper tool should be implemented in BaseTools? We may want the same helper tool to convert to between any of the supported encodings so it can be used to convert current UTF-16LE to UTF-8 too. Thanks, Mike -----Original Message----- From: Justen, Jordan L Sent: Monday, June 01, 2015 12:32 AM To: edk2-devel@lists.sourceforge.net Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D; Laszlo Ersek Subject: [PATCH v3 4/8] BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files Supplementary Plane characters can exist in UTF-16 files, but they are not valid UCS-2 characters. For example, refer to this python interpreter code: >>> import codecs >>> codecs.encode(u'\U00010300', 'utf-16') '\xff\xfe\x00\xd8\x00\xdf' Therefore the UCS-4 0x00010300 character is encoded as two 16-bit numbers (0xd800 0xdf00) in a little endian UTF-16 file. For more information, see: http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF This means that our current BaseTools code could be allowing unsupported UTF-16 characters be used. To fix this, we decode the file using python's utf-16 decode support. Then we verify that each character's code point is 0xffff or less. v3: Based on Mike Kinney's feedback, we now read the whole file and verify up-front that it contains valid UCS-2 characters. Thanks also to Laszlo Ersek for pointing out the Supplementary Plane characters. Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com> Cc: Yingke D Liu <yingke.d....@intel.com> Cc: Michael D Kinney <michael.d.kin...@intel.com> Cc: Laszlo Ersek <ler...@redhat.com> --- BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py b/BaseTools/Source/Python/AutoGen/UniClassObject.py index aa54f4f..66fdbf0 100644 --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py @@ -19,6 +19,7 @@ import Common.LongFilePathOs as os, codecs, re import distutils.util import Common.EdkLogger as EdkLogger +import StringIO from Common.BuildToolError import * from Common.String import GetLineNo from Common.Misc import PathClass @@ -147,6 +148,33 @@ def GetLanguageCode(LangName, IsCompatibleMode, File): EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 language code : %s" % LangName, File) +## Ucs2Codec +# +# This is only a partial codec implementation. It only supports +# encoding, and is primarily used to check that all the characters are +# valid for UCS-2. +# +class Ucs2Codec(codecs.Codec): + def __init__(self): + self.__utf16 = codecs.lookup('utf-16') + + def encode(self, input, errors='strict'): + for Char in input: + if ord(Char) > 0xffff: + raise ValueError("Code Point too large to encode in UCS-2") + return self.__utf16.encode(input) + +TheUcs2Codec = Ucs2Codec() This is creating a global object in this module for the USC-2 codec. Needs a comment to describe this. +def Ucs2Search(name): + if name == 'ucs-2': + return codecs.CodecInfo( + name=name, + encode=TheUcs2Codec.encode, + decode=TheUcs2Codec.decode) + else: + return None +codecs.register(Ucs2Search) This is registering the new UCS-2 codec with the codecs module when this module is imported. Need a comment to describe this. + ## StringDefClassObject # # A structure for language definition @@ -209,7 +237,7 @@ class UniFileClassObject(object): Lang = distutils.util.split_quoted((Line.split(u"//")[0])) if len(Lang) != 3: try: - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16').read() + FileIn = self.OpenUniFile(LongFilePath(File.Path)) except UnicodeError, X: EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File); except: @@ -253,6 +281,58 @@ class UniFileClassObject(object): self.OrderedStringDict[LangName][Item.StringName] = len(self.OrderedStringList[LangName]) - 1 return True + def OpenUniFile(self, FileName): + # + # Read file + # + try: + UniFile = open(FileName, mode='rb') + FileIn = UniFile.read() + UniFile.close() + except: + EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File) + + # + # We currently only support UTF-16 + # + Encoding = 'utf-16' This comment is confusing because the patch allows UNI files to be in different encodings. Is this really referring to the encoding used internally to manage content read from an external file? + + self.VerifyUcs2Data(FileIn, FileName, Encoding) + + UniFile = StringIO.StringIO(FileIn) + Info = codecs.lookup(Encoding) + (Reader, Writer) = (Info.streamreader, Info.streamwriter) + return codecs.StreamReaderWriter(UniFile, Reader, Writer) + + def VerifyUcs2Data(self, FileIn, FileName, Encoding): + Ucs2Info = codecs.lookup('ucs-2') + # + # Convert to unicode + # + try: + FileDecoded = codecs.decode(FileIn, Encoding) + Ucs2Info.encode(FileDecoded) + except: + UniFile = StringIO.StringIO(FileIn) + Info = codecs.lookup(Encoding) + (Reader, Writer) = (Info.streamreader, Info.streamwriter) + File = codecs.StreamReaderWriter(UniFile, Reader, Writer) + LineNumber = 0 + ErrMsg = lambda Encoding, LineNumber: \ + '%s contains invalid %s characters on line %d.' % \ + (FileName, Encoding, LineNumber) + while True: + LineNumber = LineNumber + 1 + try: + Line = File.readline() + if Line == '': + EdkLogger.error('Unicode File Parser', PARSER_ERROR, + ErrMsg(Encoding, LineNumber)) + Ucs2Info.encode(Line) + except: + EdkLogger.error('Unicode File Parser', PARSER_ERROR, + ErrMsg('UCS-2', LineNumber)) + # # Get String name and value # @@ -305,7 +385,7 @@ class UniFileClassObject(object): EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, ExtraData=File.Path) try: - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', encoding='utf-16') + FileIn = self.OpenUniFile(LongFilePath(File.Path)) except UnicodeError, X: EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: %s" % str(X), ExtraData=File.Path); except: -- 2.1.4 ------------------------------------------------------------------------------ _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel