UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files

Kinney, Michael D Sun, 07 Jun 2015 20:47:34 -0700

Jordan,

The functionality of the patch set looks good.  Just a few comments below about 
needing some better comments.


Reviewed-by: Michael Kinney <michael.d.kin...@intel.com>

We also discussed the need for an extra tool that can scan a workspace or a 
package for UNI files and convert them to UTF-16LE.  This tool may be required 
when releases are generated to guarantee compatibility with other tools that 
operate on UNI files that require UNI files to be UTF-16LE. Do you think this 
helper tool should be implemented in BaseTools?  We may want the same helper 
tool to convert to between any of the supported encodings so it can be used to 
convert current UTF-16LE to UTF-8 too.

Thanks,

Mike

-----Original Message-----
From: Justen, Jordan L 
Sent: Monday, June 01, 2015 12:32 AM
To: edk2-devel@lists.sourceforge.net
Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D; Laszlo Ersek
Subject: [PATCH v3 4/8] BaseTools/UniClassObject: Verify valid UCS-2 chars in 
UTF-16 .uni files

Supplementary Plane characters can exist in UTF-16 files,
but they are not valid UCS-2 characters.

For example, refer to this python interpreter code:
>>> import codecs
>>> codecs.encode(u'\U00010300', 'utf-16')
'\xff\xfe\x00\xd8\x00\xdf'

Therefore the UCS-4 0x00010300 character is encoded as two
16-bit numbers (0xd800 0xdf00) in a little endian UTF-16
file.

For more information, see:
http://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF

This means that our current BaseTools code could be allowing
unsupported UTF-16 characters be used. To fix this, we decode the file
using python's utf-16 decode support. Then we verify that each
character's code point is 0xffff or less.

v3: Based on Mike Kinney's feedback, we now read the whole file and
    verify up-front that it contains valid UCS-2 characters. Thanks
    also to Laszlo Ersek for pointing out the Supplementary Plane
    characters.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com>
Cc: Yingke D Liu <yingke.d....@intel.com>
Cc: Michael D Kinney <michael.d.kin...@intel.com>
Cc: Laszlo Ersek <ler...@redhat.com>
---
 BaseTools/Source/Python/AutoGen/UniClassObject.py | 84 ++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py 
b/BaseTools/Source/Python/AutoGen/UniClassObject.py
index aa54f4f..66fdbf0 100644
--- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
+++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
@@ -19,6 +19,7 @@
 import Common.LongFilePathOs as os, codecs, re
 import distutils.util
 import Common.EdkLogger as EdkLogger
+import StringIO
 from Common.BuildToolError import *
 from Common.String import GetLineNo
 from Common.Misc import PathClass
@@ -147,6 +148,33 @@ def GetLanguageCode(LangName, IsCompatibleMode, File):
 
     EdkLogger.error("Unicode File Parser", FORMAT_INVALID, "Invalid RFC 4646 
language code : %s" % LangName, File)
 
+## Ucs2Codec
+#
+# This is only a partial codec implementation. It only supports
+# encoding, and is primarily used to check that all the characters are
+# valid for UCS-2.
+#
+class Ucs2Codec(codecs.Codec):
+    def __init__(self):
+        self.__utf16 = codecs.lookup('utf-16')
+
+    def encode(self, input, errors='strict'):
+        for Char in input:
+            if ord(Char) > 0xffff:
+                raise ValueError("Code Point too large to encode in UCS-2")
+        return self.__utf16.encode(input)
+
+TheUcs2Codec = Ucs2Codec()

This is creating a global object in this module for the USC-2 codec.  Needs a 
comment to describe this.

+def Ucs2Search(name):
+    if name == 'ucs-2':
+        return codecs.CodecInfo(
+            name=name,
+            encode=TheUcs2Codec.encode,
+            decode=TheUcs2Codec.decode)
+    else:
+        return None
+codecs.register(Ucs2Search)

This is registering the new UCS-2 codec with the codecs module when this module 
is imported.  Need a comment to describe this.

+
 ## StringDefClassObject
 #
 # A structure for language definition
@@ -209,7 +237,7 @@ class UniFileClassObject(object):
         Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
         if len(Lang) != 3:
             try:
-                FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
encoding='utf-16').read()
+                FileIn = self.OpenUniFile(LongFilePath(File.Path))
             except UnicodeError, X:
                 EdkLogger.error("build", FILE_READ_FAILURE, "File read 
failure: %s" % str(X), ExtraData=File);
             except:
@@ -253,6 +281,58 @@ class UniFileClassObject(object):
                     self.OrderedStringDict[LangName][Item.StringName] = 
len(self.OrderedStringList[LangName]) - 1
         return True
 
+    def OpenUniFile(self, FileName):
+        #
+        # Read file
+        #
+        try:
+            UniFile = open(FileName, mode='rb')
+            FileIn = UniFile.read()
+            UniFile.close()
+        except:
+            EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File)
+
+        #
+        # We currently only support UTF-16
+        #
+        Encoding = 'utf-16'

This comment is confusing because the patch allows UNI files to be in different 
encodings.  Is this really referring to the encoding used internally to manage 
content read from an external file?

+
+        self.VerifyUcs2Data(FileIn, FileName, Encoding)
+
+        UniFile = StringIO.StringIO(FileIn)
+        Info = codecs.lookup(Encoding)
+        (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+        return codecs.StreamReaderWriter(UniFile, Reader, Writer)
+
+    def VerifyUcs2Data(self, FileIn, FileName, Encoding):
+        Ucs2Info = codecs.lookup('ucs-2')
+        #
+        # Convert to unicode
+        #
+        try:
+            FileDecoded = codecs.decode(FileIn, Encoding)
+            Ucs2Info.encode(FileDecoded)
+        except:
+            UniFile = StringIO.StringIO(FileIn)
+            Info = codecs.lookup(Encoding)
+            (Reader, Writer) = (Info.streamreader, Info.streamwriter)
+            File = codecs.StreamReaderWriter(UniFile, Reader, Writer)
+            LineNumber = 0
+            ErrMsg = lambda Encoding, LineNumber: \
+                     '%s contains invalid %s characters on line %d.' % \
+                     (FileName, Encoding, LineNumber)
+            while True:
+                LineNumber = LineNumber + 1
+                try:
+                    Line = File.readline()
+                    if Line == '':
+                        EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+                                        ErrMsg(Encoding, LineNumber))
+                    Ucs2Info.encode(Line)
+                except:
+                    EdkLogger.error('Unicode File Parser', PARSER_ERROR,
+                                    ErrMsg('UCS-2', LineNumber))
+
     #
     # Get String name and value
     #
@@ -305,7 +385,7 @@ class UniFileClassObject(object):
             EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, 
ExtraData=File.Path)
 
         try:
-            FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
encoding='utf-16')
+            FileIn = self.OpenUniFile(LongFilePath(File.Path))
         except UnicodeError, X:
             EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: 
%s" % str(X), ExtraData=File.Path);
         except:
-- 
2.1.4


------------------------------------------------------------------------------
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [PATCH v3 4/8] BaseTools/UniClassObject: Verify valid UCS-2 chars in UTF-16 .uni files

Reply via email to