(I couldn't find a better point to insert my two cents in the thread, so I'll just follow up here.)
On 05/04/15 20:06, Jordan Justen wrote: > On 2015-05-04 10:48:05, Andrew Fish wrote: >> On May 4, 2015, at 10:33 AM, Jordan Justen <jordan.l.jus...@intel.com> >> wrote: >> On 2015-05-04 08:57:29, Kinney, Michael D wrote: >> >> Jordan, >> >> Some source control systems provide support a file type of UTF-16LE, >> so the use of 'binary' should be avoided. What source control >> systems require the use of 'binary'? >> >> Svn seems to require it so the data doesn't get munged. >> >> Git seems to auto-detect UTF-16 .uni files as binary. UTF-16 files are *by definition* binary, because they are byte order dependent. UTF-8 in comparison is byte order agnostic. >> >> What diff utilities are having issues with UTF-16LE files? Can you >> provide some examples? >> >> I don't think it is possible to create a .uni patch file for EDK II >> that actually works. >> >> Normally, for .uni files we just see something about the diff being >> omitted on a binary file. >> >> With git, I know there are ways to force a diff to be produced, but I >> don't believe it would actually work to apply a patch to a UTF-16 >> file. >> >> This >> http://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text >> stackoverflow >> article seems to imply you can can make git-merge work with a >> .gitattributes file setting? I had posted a more or less "complete" solution for working with UNI files under git, in Feb 2014, and Brian J. Johnson added a bunch of valuable info to it (see all messages in the thread): http://thread.gmane.org/gmane.comp.bios.tianocore.devel/6351 This approach solves diffing (+ merging, rebasing etc) patches for UNI files. What it doesn't solve, unfortunately, are the following issues: (a) UTF-8 is a standard *external* text representation, while UTF-16 / UCS-2 is at best a standard *internal* text representation (ie. maybe for objects living inside C programs). On most Linux computers, locales will have been set up for handling UTF-8, allowing terminal emulators and various text-oriented utilities to work with (incl. "display") UTF-8 files out of the box. (In fact I had to write a small script called "uni-grep" because otherwise I can't git-grep the UNI files of the edk2 tree for patterns.) Additionally, UTF-16 capable editors are presumably a rare species in comparison to UTF-8 capable ones. If we consider that the edk2 tree mostly contains English translations (which fit in ASCII), then choosing UTF-16 excludes "old" (ie. 8-bit-only) editors for no good practical reason. IMO, choosing UTF-16LE for external text encoding is an inconvenient Windows-ism that breaks standard (=POSIX) text utilities, and non-standard yet widespread tools, in most Linux distros. (b) The git hacks mentioned thus far do not cover git-format-patch. Git will always think that UTF-16LE encoded files are binary (because they are), and therefore it will format patches for them as binary deltas (encoded with base64 or something similar). The hacks referenced above exist on a higher (more abstract) level only. Binary delta patches are unreviewable on a mailing list. For patches that touch UTF-8 text files, git-format-patch generates plaintext emails, with correct headers such as: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit > Is there a concern with supporting UTF-8? > > It seems like in general UTF-8 is better supported, and requires no > configuration tweaks. > > I think the situation is that UTF-8 has become the most commonly used > format, and therefore it is the most likely format to work well with > tools. > > For EDK II's needs, I can't see a downside to supporting UTF-8, and it > did not require a huge amount of effort. I'd support the idea of going UTF-8-only. The files can be converted all at once (it would take a 10-20 line shell script approximately, and conversion errors could be caught), in one big commit, or else we could move forward gradually (same as with the nasm conversion). I do think such files should be distinguished with a separate filename suffix. >> >> With utf-8, it seems to 'just work'. (With git and svn.) >> >> Also, what are the pros/cons to extending the exiting .uni file >> extension to support either UTF16-LE or UTF-8 encodings vs. adding a >> new file extension? >> >> I don't know of any pros/cons. We could assume that a file without the >> UTF-16 BOM is UTF-8. I wouldn't recommend that: http://en.wikipedia.org/wiki/Byte_order_mark > Clause D98 of conformance (section 3.10) of the Unicode standard > states, "The UTF-16 encoding scheme may or may not begin with a BOM. > However, when there is no BOM, and in the absence of a higher-level > protocol, the byte order of the UTF-16 encoding scheme is big-endian." > Whether or not a higher-level protocol is in force is open to > interpretation. Files local to a computer for which the native byte > ordering is little-endian, for example, might be argued to be encoded > as UTF-16LE implicitly. Therefore the presumption of big-endian is > widely ignored. When those same files are accessible on the Internet, > on the other hand, no such presumption can be made. As I said, just my two cents. :) Thanks Laszlo >> >> It would certainly mess up my unit tests. :) >> >> -Jordan > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/edk2-devel > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ edk2-devel mailing list edk2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/edk2-devel