Re: [edk2] [PATCH 0/9] Support UTF-8 (.utf8) string files

Brian J. Johnson Mon, 04 May 2015 14:37:18 -0700

On 05/04/2015 02:11 PM, Laszlo Ersek wrote:
> (I couldn't find a better point to insert my two cents in the thread, so
> I'll just follow up here.)
>
> On 05/04/15 20:06, Jordan Justen wrote:
>> On 2015-05-04 10:48:05, Andrew Fish wrote:
>>>       On May 4, 2015, at 10:33 AM, Jordan Justen <[email protected]>
>>>       wrote:
>>>       On 2015-05-04 08:57:29, Kinney, Michael D wrote:
>>>
>>>         Jordan,
>>>
>>>         Some source control systems provide support a file type of UTF-16LE,
>>>         so the use of 'binary' should be avoided. What source control
>>>         systems require the use of 'binary'?
>>>
>>>       Svn seems to require it so the data doesn't get munged.
>>>
>>>       Git seems to auto-detect UTF-16 .uni files as binary.
>
> UTF-16 files are *by definition* binary, because they are byte order
> dependent. UTF-8 in comparison is byte order agnostic.
>
>>>
>>>         What diff utilities are having issues with UTF-16LE files? Can you
>>>         provide some examples?
>>>
>>>       I don't think it is possible to create a .uni patch file for EDK II
>>>       that actually works.
>>>
>>>       Normally, for .uni files we just see something about the diff being
>>>       omitted on a binary file.
>>>
>>>       With git, I know there are ways to force a diff to be produced, but I
>>>       don't believe it would actually work to apply a patch to a UTF-16
>>>       file.
>>>
>>>     This 
>>> http://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text
>>>  stackoverflow
>>>     article seems to imply you can can make git-merge work with a
>>>     .gitattributes file setting?
>
> I had posted a more or less "complete" solution for working with UNI
> files under git, in Feb 2014, and Brian J. Johnson added a bunch of
> valuable info to it (see all messages in the thread):
>
> http://thread.gmane.org/gmane.comp.bios.tianocore.devel/6351
>
> This approach solves diffing (+ merging, rebasing etc) patches for UNI
> files. What it doesn't solve, unfortunately, are the following issues:
>


FWIW, I vote for adding UTF-8 input file support.  It should make 
working with string files on Linux a lot simpler.

> (a) UTF-8 is a standard *external* text representation, while UTF-16 /
> UCS-2 is at best a standard *internal* text representation (ie. maybe
> for objects living inside C programs).
>
> On most Linux computers, locales will have been set up for handling
> UTF-8, allowing terminal emulators and various text-oriented utilities
> to work with (incl. "display") UTF-8 files out of the box.
>
> (In fact I had to write a small script called "uni-grep" because
> otherwise I can't git-grep the UNI files of the edk2 tree for patterns.)
>
> Additionally, UTF-16 capable editors are presumably a rare species in
> comparison to UTF-8 capable ones. If we consider that the edk2 tree
> mostly contains English translations (which fit in ASCII), then choosing
> UTF-16 excludes "old" (ie. 8-bit-only) editors for no good practical reason.
>
> IMO, choosing UTF-16LE for external text encoding is an inconvenient
> Windows-ism that breaks standard (=POSIX) text utilities, and
> non-standard yet widespread tools, in most Linux distros.
>

All good points.  I agree.

> (b) The git hacks mentioned thus far do not cover git-format-patch. Git
> will always think that UTF-16LE encoded files are binary (because they
> are), and therefore it will format patches for them as binary deltas
> (encoded with base64 or something similar). The hacks referenced above
> exist on a higher (more abstract) level only.
>
> Binary delta patches are unreviewable on a mailing list.
>
> For patches that touch UTF-8 text files, git-format-patch generates
> plaintext emails, with correct headers such as:
>
>    MIME-Version: 1.0
>    Content-Type: text/plain; charset=UTF-8
>    Content-Transfer-Encoding: 8bit
>
>> Is there a concern with supporting UTF-8?
>>
>> It seems like in general UTF-8 is better supported, and requires no
>> configuration tweaks.
>>
>> I think the situation is that UTF-8 has become the most commonly used
>> format, and therefore it is the most likely format to work well with
>> tools.
>>
>> For EDK II's needs, I can't see a downside to supporting UTF-8, and it
>> did not require a huge amount of effort.
>

Seconded.

> I'd support the idea of going UTF-8-only. The files can be converted all
> at once (it would take a 10-20 line shell script approximately, and
> conversion errors could be caught), in one big commit, or else we could
> move forward gradually (same as with the nasm conversion).
>

No opinion on whether the open source files should be converted at once, 
or over time, or at all.  But closed-source vendors have their own 
UTF-16 files, so we shouldn't remove support for UTF-16.

> I do think such files should be distinguished with a separate filename
> suffix.

Yes.  Otherwise developers will get confused why some ".uni" files work 
with their tools, and some do not.

Nice work, Jordan!
-- 

                                                Brian J. Johnson

--------------------------------------------------------------------

   My statements are my own, are not authorized by SGI, and do not
   necessarily represent SGI’s positions.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Re: [edk2] [PATCH 0/9] Support UTF-8 (.utf8) string files

Reply via email to