On 05/04/2015 02:11 PM, Laszlo Ersek wrote: > (I couldn't find a better point to insert my two cents in the thread, so > I'll just follow up here.) > > On 05/04/15 20:06, Jordan Justen wrote: >> On 2015-05-04 10:48:05, Andrew Fish wrote: >>> On May 4, 2015, at 10:33 AM, Jordan Justen <jordan.l.jus...@intel.com> >>> wrote: >>> On 2015-05-04 08:57:29, Kinney, Michael D wrote: >>> >>> Jordan, >>> >>> Some source control systems provide support a file type of UTF-16LE, >>> so the use of 'binary' should be avoided. What source control >>> systems require the use of 'binary'? >>> >>> Svn seems to require it so the data doesn't get munged. >>> >>> Git seems to auto-detect UTF-16 .uni files as binary. > > UTF-16 files are *by definition* binary, because they are byte order > dependent. UTF-8 in comparison is byte order agnostic. > >>> >>> What diff utilities are having issues with UTF-16LE files? Can you >>> provide some examples? >>> >>> I don't think it is possible to create a .uni patch file for EDK II >>> that actually works. >>> >>> Normally, for .uni files we just see something about the diff being >>> omitted on a binary file. >>> >>> With git, I know there are ways to force a diff to be produced, but I >>> don't believe it would actually work to apply a patch to a UTF-16 >>> file. >>> >>> This >>> http://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text >>> stackoverflow >>> article seems to imply you can can make git-merge work with a >>> .gitattributes file setting? > > I had posted a more or less "complete" solution for working with UNI > files under git, in Feb 2014, and Brian J. Johnson added a bunch of > valuable info to it (see all messages in the thread): > > http://thread.gmane.org/gmane.comp.bios.tianocore.devel/6351 > > This approach solves diffing (+ merging, rebasing etc) patches for UNI > files. What it doesn't solve, unfortunately, are the following issues: >
FWIW, I vote for adding UTF-8 input file support. It should make working with string files on Linux a lot simpler. > (a) UTF-8 is a standard *external* text representation, while UTF-16 / > UCS-2 is at best a standard *internal* text representation (ie. maybe > for objects living inside C programs). > > On most Linux computers, locales will have been set up for handling > UTF-8, allowing terminal emulators and various text-oriented utilities > to work with (incl. "display") UTF-8 files out of the box. > > (In fact I had to write a small script called "uni-grep" because > otherwise I can't git-grep the UNI files of the edk2 tree for patterns.) > > Additionally, UTF-16 capable editors are presumably a rare species in > comparison to UTF-8 capable ones. If we consider that the edk2 tree > mostly contains English translations (which fit in ASCII), then choosing > UTF-16 excludes "old" (ie. 8-bit-only) editors for no good practical reason. > > IMO, choosing UTF-16LE for external text encoding is an inconvenient > Windows-ism that breaks standard (=POSIX) text utilities, and > non-standard yet widespread tools, in most Linux distros. > All good points. I agree. > (b) The git hacks mentioned thus far do not cover git-format-patch. Git > will always think that UTF-16LE encoded files are binary (because they > are), and therefore it will format patches for them as binary deltas > (encoded with base64 or something similar). The hacks referenced above > exist on a higher (more abstract) level only. > > Binary delta patches are unreviewable on a mailing list. > > For patches that touch UTF-8 text files, git-format-patch generates > plaintext emails, with correct headers such as: > > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > >> Is there a concern with supporting UTF-8? >> >> It seems like in general UTF-8 is better supported, and requires no >> configuration tweaks. >> >> I think the situation is that UTF-8 has become the most commonly used >> format, and therefore it is the most likely format to work well with >> tools. >> >> For EDK II's needs, I can't see a downside to supporting UTF-8, and it >> did not require a huge amount of effort. > Seconded. > I'd support the idea of going UTF-8-only. The files can be converted all > at once (it would take a 10-20 line shell script approximately, and > conversion errors could be caught), in one big commit, or else we could > move forward gradually (same as with the nasm conversion). > No opinion on whether the open source files should be converted at once, or over time, or at all. But closed-source vendors have their own UTF-16 files, so we shouldn't remove support for UTF-16. > I do think such files should be distinguished with a separate filename > suffix. Yes. Otherwise developers will get confused why some ".uni" files work with their tools, and some do not. Nice work, Jordan! -- Brian J. Johnson -------------------------------------------------------------------- My statements are my own, are not authorized by SGI, and do not necessarily represent SGI’s positions. ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel