Jordan, Here is the link to the EDK II spec that states that UNI files must be UTF16-LE
http://sourceforge.net/projects/edk2/files/Specifications/UNI_File_Spec_v1_2_Errata_A.pdf/download I think there are good reasons to verify that the entire UNI file only contains UCS-2 characters. Editors are usually in a single Unicode mode, so it would be difficult when using an editor in an extended mode to know if only the string contents are UCS-2. Also, if we decide to add helper tools to support conversion to UTF-16LE to follow current the EDK II UNI specification for releases or compatibility with other tools, then what would that helper tool do with a non UCS-2 character? It is just easier for the whole file to use the same character set. Best regards, Mike -----Original Message----- From: Justen, Jordan L Sent: Tuesday, May 05, 2015 10:53 AM To: Kinney, Michael D; edk2-devel@lists.sourceforge.net Cc: Liu, Yingke D Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files On 2015-05-05 09:07:11, Kinney, Michael D wrote: > Jordan, > > If we are going to add support for more UNI file formats, there are > also EDK II specifications that must be updated. I don't know about that. I at least looked under BaseTools/UserManuals, and didn't find anything obvious to update. I'm not sure who owns updating other specs, but they can probably make the changes to the spec when new features appear in a code base. (The old UTF-16 support should still work as spec'd.) Wouldn't this be more of a concern for future UDK releases? > I am not sure I agree with only checking that the string value has > supported Unicode characters. If the Name or Language elements have > unsupported Unicode characters, then that will cause problems too. I > think I would prefer the entire file, including all comment > lines/blocks to only contain the supported Unicode characters. If you are referring to comment in the UTF-8 version of the file, why constrain them? Obviously there are technical reasons for constraining string values, but the same need not apply to comments. > I like the addition of OpenUniFile() so we have one place to update ... open place, except apparently UPT has duplicated most of this code. :) -Jordan > if we decide to add support for more file formats. Please look at > the code fragments below that I think can simplify the logic and > make it more readable/maintainable by taking advantage of more of > the codecs module functions and constants. These code fragments are > not based on the current trunk, so there may be some unexpected > differences. > > The codecs module has some constants that can improve the > readability of this logic. The following code fragment detects the > BOM marker and determines the encoding. > > # > # Read file > # > try: > FileIn = open (LongFilePath(File.Path), mode='rb').read() > except: > EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File) > > > # > # Detect Byte Order Mark at beginning of file. Default to UTF-8 > # > Encoding = 'utf-8' > if FileIn.startswith (codecs.BOM_UTF16_BE): > Encoding = 'utf-16be' > FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE) > elif FileIn.startswith (codecs.BOM_UTF16_LE): > Encoding = 'utf-16le' > FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE) > elif FileIn.startswith (codecs.BOM_UTF8): > Encoding = 'utf-8-sig' > FileIn = FileIn.lstrip (codecs.BOM_UTF8) > > The following code fragment uses the codecs module and the encoding detected > above to verify that all the characters in a UNI file are legal UCS-2 > characters. If an invalid character is detected, then additional logic is > run in an except clause to determine the line number of the invalid > character. > > # > # Convert to unicode > # > try: > FileIn = codecs.decode (FileIn, Encoding) > Verify = codecs.encode (FileIn, 'utf-16') > Verify = codecs.decode (Verify, 'utf-16') > except: > FileIn = codecs.open (LongFilePath(File.Path), > encoding=Encoding, mode='r') > LineNumber = 0 > while True: > LineNumber = LineNumber + 1 > try: > Line = FileIn.readline() > if Line == '': > EdkLogger.error('Unicode File Parser', > PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path)) > Line = codecs.encode (Line, 'utf-16') > Line = codecs.decode (Line, 'utf-16') > except: > EdkLogger.error('Unicode File Parser', PARSER_ERROR, > '%s contains invalid UCS-2 character at line %d.' % (File.Path, LineNumber)) > > Best regards, > > Mike > > -----Original Message----- > From: Justen, Jordan L > Sent: Tuesday, May 05, 2015 12:09 AM > To: edk2-devel@lists.sourceforge.net > Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D > Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files > > Since UEFI only support UTF-16LE strings internally, this simply > allows for another unicode the source file encoding. > > The strings are still converted to UTF-16LE data for use in EDK II > source code. > > When .uni files contain UTF-16 data, it is impossible for unicode code > points to be larger than 0xFFFF. To support .uni files that contain > UTF-8 data, we also need to also deal with the possibility that the > UTF-8 file contains unicode code points larger than 16-bits. Since > UEFI only supports 16-bit string data, we make UniClassObject generate > an error if a larger code point is seen in a UTF-8 string value. > > We only check string value data, so it is possible to use larger code > points in comments. > > v2: > * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney) > * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit' > commit > > Cc: Yingke D Liu <yingke.d....@intel.com> > Cc: Michael D Kinney <michael.d.kin...@intel.com> > Contributed-under: TianoCore Contribution Agreement 1.0 > Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com> > --- > BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 > +++++++++++++++++++++-- > 1 file changed, 36 insertions(+), 2 deletions(-) > > diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py > b/BaseTools/Source/Python/AutoGen/UniClassObject.py > index aa54f4f..41448ab 100644 > --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py > +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py > @@ -209,7 +209,7 @@ class UniFileClassObject(object): > Lang = distutils.util.split_quoted((Line.split(u"//")[0])) > if len(Lang) != 3: > try: > - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', > encoding='utf-16').read() > + FileIn = self.OpenUniFile(LongFilePath(File.Path)) > except UnicodeError, X: > EdkLogger.error("build", FILE_READ_FAILURE, "File read > failure: %s" % str(X), ExtraData=File); > except: > @@ -253,6 +253,38 @@ class UniFileClassObject(object): > self.OrderedStringDict[LangName][Item.StringName] = > len(self.OrderedStringList[LangName]) - 1 > return True > > + def OpenUniFile(self, FileName): > + Encoding = 'utf-8' > + UniFile = open(FileName, 'rb') > + > + # > + # Seek to end of file to determine its size > + # > + UniFile.seek(0, 2) > + FileSize = UniFile.tell() > + > + if FileSize >= 2: > + # > + # Seek to start of the file to read the UTF-16 BOM > + # > + UniFile.seek(0, 0) > + Bom = UniFile.read(2) > + UniFile.seek(0, 0) > + > + if Bom == '\xff\xfe': > + Encoding = 'utf-16' > + > + Info = codecs.lookup(Encoding) > + return codecs.StreamReaderWriter(UniFile, Info.streamreader, > Info.streamwriter) > + > + def Verify16bitCodePoints(self, String): > + for cp in String: > + if ord(cp) > 0xffff: > + tmpl = 'The string {} defined in file {} ' + \ > + 'contains a character with a code point above 0xFFFF.' > + error = tmpl.format(repr(String), self.File) > + EdkLogger.error('Unicode File Parser', FORMAT_INVALID, error) > + > # > # Get String name and value > # > @@ -274,6 +306,7 @@ class UniFileClassObject(object): > Language = LanguageList[IndexI].split()[0] > Value = > LanguageList[IndexI][LanguageList[IndexI].find(u'\"') + len(u'\"') : > LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'') > Language = GetLanguageCode(Language, self.IsCompatibleMode, > self.File) > + self.Verify16bitCodePoints(Value) > self.AddStringToList(Name, Language, Value) > > # > @@ -305,7 +338,7 @@ class UniFileClassObject(object): > EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, > ExtraData=File.Path) > > try: > - FileIn = codecs.open(LongFilePath(File.Path), mode='rb', > encoding='utf-16') > + FileIn = self.OpenUniFile(LongFilePath(File.Path)) > except UnicodeError, X: > EdkLogger.error("build", FILE_READ_FAILURE, "File read failure: > %s" % str(X), ExtraData=File.Path); > except: > @@ -426,6 +459,7 @@ class UniFileClassObject(object): > MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE) > if MatchString == None or MatchString.end(0) != > len(Name): > EdkLogger.error('Unicode File Parser', > FORMAT_INVALID, 'The string token name %s defined in UNI file %s contains the > invalid lower case character.' %(Name, self.File)) > + self.Verify16bitCodePoints(Value) > self.AddStringToList(Name, Language, Value) > continue > > -- > 2.1.4 > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel