Jordan,

My main point about the spec is that both need to be updated if this concept is 
accepted.  Not that one has to be done before the other.

Mike

-----Original Message-----
From: Justen, Jordan L 
Sent: Tuesday, May 05, 2015 11:52 AM
To: Kinney, Michael D; edk2-devel@lists.sourceforge.net
Cc: Liu, Yingke D
Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files

On 2015-05-05 11:32:57, Kinney, Michael D wrote:
> Jordan,
> 
> Here is the link to the EDK II spec that states that UNI files must
> be UTF16-LE
> 
> http://sourceforge.net/projects/edk2/files/Specifications/UNI_File_Spec_v1_2_Errata_A.pdf/download

I don't know where the source code is for this document, so I don't
see how I can update it.

I think adding this support to EDK II will still mean that BaseTools
supports the current version of the spec since it can still handle
UTF-16 files. The last patch to convert the file in OvmfPkg to UTF-8
may be invalid by the spec.

Anyway, I think the more natural way to make proceed is to add UTF-8
.uni support to EDK II and then to document the support in the spec.
Do you disagree?

I don't want to waste any more time on this, if we can't add it to EDK
II because it is not in the spec. Likewise, I don't see how we can add
it to the spec if it is not supported by EDK II.

> I think there are good reasons to verify that the entire UNI file
> only contains UCS-2 characters. Editors are usually in a single
> Unicode mode, so it would be difficult when using an editor in an
> extended mode to know if only the string contents are UCS-2. Also,
> if we decide to add helper tools to support conversion to UTF-16LE
> to follow current the EDK II UNI specification for releases or
> compatibility with other tools, then what would that helper tool do
> with a non UCS-2 character? It is just easier for the whole file to
> use the same character set.

Ok. That sounds like a good reason.

-Jordan

> -----Original Message-----
> From: Justen, Jordan L 
> Sent: Tuesday, May 05, 2015 10:53 AM
> To: Kinney, Michael D; edk2-devel@lists.sourceforge.net
> Cc: Liu, Yingke D
> Subject: RE: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files
> 
> On 2015-05-05 09:07:11, Kinney, Michael D wrote:
> > Jordan,
> > 
> > If we are going to add support for more UNI file formats, there are
> > also EDK II specifications that must be updated.
> 
> I don't know about that. I at least looked under
> BaseTools/UserManuals, and didn't find anything obvious to update.
> 
> I'm not sure who owns updating other specs, but they can probably make
> the changes to the spec when new features appear in a code base. (The
> old UTF-16 support should still work as spec'd.) Wouldn't this be more
> of a concern for future UDK releases?
> 
> > I am not sure I agree with only checking that the string value has
> > supported Unicode characters. If the Name or Language elements have
> > unsupported Unicode characters, then that will cause problems too. I
> > think I would prefer the entire file, including all comment
> > lines/blocks to only contain the supported Unicode characters.
> 
> If you are referring to comment in the UTF-8 version of the file, why
> constrain them? Obviously there are technical reasons for constraining
> string values, but the same need not apply to comments.
> 
> > I like the addition of OpenUniFile() so we have one place to update
> 
> ... open place, except apparently UPT has duplicated most of this
> code. :)
> 
> -Jordan
> 
> > if we decide to add support for more file formats. Please look at
> > the code fragments below that I think can simplify the logic and
> > make it more readable/maintainable by taking advantage of more of
> > the codecs module functions and constants. These code fragments are
> > not based on the current trunk, so there may be some unexpected
> > differences.
> > 
> > The codecs module has some constants that can improve the
> > readability of this logic. The following code fragment detects the
> > BOM marker and determines the encoding.
> > 
> >             #
> >             # Read file
> >             #
> >             try:
> >                 FileIn = open (LongFilePath(File.Path), mode='rb').read()
> >             except:
> >                 EdkLogger.Error("build", FILE_OPEN_FAILURE, ExtraData=File) 
> >            
> > 
> >             #
> >             # Detect Byte Order Mark at beginning of file.  Default to UTF-8
> >             #
> >             Encoding = 'utf-8'
> >             if FileIn.startswith (codecs.BOM_UTF16_BE):
> >                 Encoding = 'utf-16be'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_BE)
> >             elif FileIn.startswith (codecs.BOM_UTF16_LE):
> >                 Encoding = 'utf-16le'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF16_LE)
> >             elif FileIn.startswith (codecs.BOM_UTF8):
> >                 Encoding = 'utf-8-sig'
> >                 FileIn = FileIn.lstrip (codecs.BOM_UTF8)
> > 
> > The following code fragment uses the codecs module and the encoding 
> > detected above to verify that all the characters in a UNI file are legal 
> > UCS-2 characters.  If an invalid character is detected, then additional 
> > logic is run in an except clause to determine the line number of the 
> > invalid character.  
> > 
> >             #
> >             # Convert to unicode
> >             #
> >             try:
> >                 FileIn = codecs.decode (FileIn, Encoding)
> >                 Verify = codecs.encode (FileIn, 'utf-16')
> >                 Verify = codecs.decode (Verify, 'utf-16')
> >             except:
> >                 FileIn = codecs.open (LongFilePath(File.Path), 
> > encoding=Encoding, mode='r')
> >                 LineNumber = 0
> >                 while True:
> >                     LineNumber = LineNumber + 1
> >                     try:
> >                         Line = FileIn.readline()
> >                         if Line == '':
> >                             EdkLogger.error('Unicode File Parser', 
> > PARSER_ERROR, '%s contains invalid UCS-2 characters.' % (File.Path))
> >                         Line = codecs.encode (Line, 'utf-16')
> >                         Line = codecs.decode (Line, 'utf-16')
> >                     except:
> >                         EdkLogger.error('Unicode File Parser', 
> > PARSER_ERROR, '%s contains invalid UCS-2 character at line %d.' % 
> > (File.Path, LineNumber))
> > 
> > Best regards,
> > 
> > Mike
> >         
> > -----Original Message-----
> > From: Justen, Jordan L 
> > Sent: Tuesday, May 05, 2015 12:09 AM
> > To: edk2-devel@lists.sourceforge.net
> > Cc: Justen, Jordan L; Liu, Yingke D; Kinney, Michael D
> > Subject: [PATCH v2 1/7] BaseTools: Support UTF-8 string data in .uni files
> > 
> > Since UEFI only support UTF-16LE strings internally, this simply
> > allows for another unicode the source file encoding.
> > 
> > The strings are still converted to UTF-16LE data for use in EDK II
> > source code.
> > 
> > When .uni files contain UTF-16 data, it is impossible for unicode code
> > points to be larger than 0xFFFF. To support .uni files that contain
> > UTF-8 data, we also need to also deal with the possibility that the
> > UTF-8 file contains unicode code points larger than 16-bits. Since
> > UEFI only supports 16-bit string data, we make UniClassObject generate
> > an error if a larger code point is seen in a UTF-8 string value.
> > 
> > We only check string value data, so it is possible to use larger code
> > points in comments.
> > 
> > v2:
> >  * Drop .utf8 extension. Use .uni file for UTF-8 data (mdkinney)
> >  * Merge in 'BaseTools/UniClassObject: Verify string data is 16-bit'
> >    commit
> > 
> > Cc: Yingke D Liu <yingke.d....@intel.com>
> > Cc: Michael D Kinney <michael.d.kin...@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.0
> > Signed-off-by: Jordan Justen <jordan.l.jus...@intel.com>
> > ---
> >  BaseTools/Source/Python/AutoGen/UniClassObject.py | 38 
> > +++++++++++++++++++++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/BaseTools/Source/Python/AutoGen/UniClassObject.py 
> > b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > index aa54f4f..41448ab 100644
> > --- a/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > +++ b/BaseTools/Source/Python/AutoGen/UniClassObject.py
> > @@ -209,7 +209,7 @@ class UniFileClassObject(object):
> >          Lang = distutils.util.split_quoted((Line.split(u"//")[0]))
> >          if len(Lang) != 3:
> >              try:
> > -                FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> > encoding='utf-16').read()
> > +                FileIn = self.OpenUniFile(LongFilePath(File.Path))
> >              except UnicodeError, X:
> >                  EdkLogger.error("build", FILE_READ_FAILURE, "File read 
> > failure: %s" % str(X), ExtraData=File);
> >              except:
> > @@ -253,6 +253,38 @@ class UniFileClassObject(object):
> >                      self.OrderedStringDict[LangName][Item.StringName] = 
> > len(self.OrderedStringList[LangName]) - 1
> >          return True
> >  
> > +    def OpenUniFile(self, FileName):
> > +        Encoding = 'utf-8'
> > +        UniFile = open(FileName, 'rb')
> > +
> > +        #
> > +        # Seek to end of file to determine its size
> > +        #
> > +        UniFile.seek(0, 2)
> > +        FileSize = UniFile.tell()
> > +
> > +        if FileSize >= 2:
> > +            #
> > +            # Seek to start of the file to read the UTF-16 BOM
> > +            #
> > +            UniFile.seek(0, 0)
> > +            Bom = UniFile.read(2)
> > +            UniFile.seek(0, 0)
> > +
> > +            if Bom == '\xff\xfe':
> > +                Encoding = 'utf-16'
> > +
> > +        Info = codecs.lookup(Encoding)
> > +        return codecs.StreamReaderWriter(UniFile, Info.streamreader, 
> > Info.streamwriter)
> > +
> > +    def Verify16bitCodePoints(self, String):
> > +        for cp in String:
> > +            if ord(cp) > 0xffff:
> > +                tmpl = 'The string {} defined in file {} ' + \
> > +                       'contains a character with a code point above 
> > 0xFFFF.'
> > +                error = tmpl.format(repr(String), self.File)
> > +                EdkLogger.error('Unicode File Parser', FORMAT_INVALID, 
> > error)
> > +
> >      #
> >      # Get String name and value
> >      #
> > @@ -274,6 +306,7 @@ class UniFileClassObject(object):
> >                  Language = LanguageList[IndexI].split()[0]
> >                  Value = 
> > LanguageList[IndexI][LanguageList[IndexI].find(u'\"') + len(u'\"') : 
> > LanguageList[IndexI].rfind(u'\"')] #.replace(u'\r\n', u'')
> >                  Language = GetLanguageCode(Language, 
> > self.IsCompatibleMode, self.File)
> > +                self.Verify16bitCodePoints(Value)
> >                  self.AddStringToList(Name, Language, Value)
> >  
> >      #
> > @@ -305,7 +338,7 @@ class UniFileClassObject(object):
> >              EdkLogger.error("Unicode File Parser", FILE_NOT_FOUND, 
> > ExtraData=File.Path)
> >  
> >          try:
> > -            FileIn = codecs.open(LongFilePath(File.Path), mode='rb', 
> > encoding='utf-16')
> > +            FileIn = self.OpenUniFile(LongFilePath(File.Path))
> >          except UnicodeError, X:
> >              EdkLogger.error("build", FILE_READ_FAILURE, "File read 
> > failure: %s" % str(X), ExtraData=File.Path);
> >          except:
> > @@ -426,6 +459,7 @@ class UniFileClassObject(object):
> >                      MatchString = re.match('[A-Z0-9_]+', Name, re.UNICODE)
> >                      if MatchString == None or MatchString.end(0) != 
> > len(Name):
> >                          EdkLogger.error('Unicode File Parser', 
> > FORMAT_INVALID, 'The string token name %s defined in UNI file %s contains 
> > the invalid lower case character.' %(Name, self.File))
> > +                self.Verify16bitCodePoints(Value)
> >                  self.AddStringToList(Name, Language, Value)
> >                  continue
> >  
> > -- 
> > 2.1.4
> > 
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Reply via email to