Re: [PATCH v1] convert: add support for 'encoding' attribute
On 03 Jan 2018, at 20:15, Junio C Hamanowrote: > Torsten Bögershausen writes: > >> May be. >> Originally utf8.c was about encoding and all kind of UTF-8 related stuff. >> Especially it didn't know anything about strbuf. >> I don't know why strbuf.h and other functions had been added here, >> >> I once moved them into strbuf.c without any problems, but never send out >> a patch, because of possible merge conflicts in ongoing patches. >> >> In any case, if it is about strbuf, I would try to put it into strbuf.c > > Please don't. > > A code that happens to use strbuf as a container and about > manipulating the contents is quite different from a code about > strbuf. The latter is to enhance and extend how the strbuf as a > container behaves. An operation about character encoding for a > string that happens to be stored in strbuf is more about the > encoding, and much much less about strbuf. > > convert.c is about massaging contents coming from the outside world > into a shape stored in Git and the other way around, and there are > multiple ways the contents are massaged. EOL convention may be > adjusted, characters may be reencoded, end-user defined conversion > may be applied. Some of these operations may use helpers specific > for the task from other more library-ish files, like checking if a > string looks like encoded in UTF-8 from utf8.[ch]. Agreed. I did that in v2. See these patches: https://public-inbox.org/git/2017122915.39680-3-lars.schnei...@autodesk.com/ https://public-inbox.org/git/2017122915.39680-4-lars.schnei...@autodesk.com/ - Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
Torsten Bögershausenwrites: > May be. > Originally utf8.c was about encoding and all kind of UTF-8 related stuff. > Especially it didn't know anything about strbuf. > I don't know why strbuf.h and other functions had been added here, > > I once moved them into strbuf.c without any problems, but never send out > a patch, because of possible merge conflicts in ongoing patches. > > In any case, if it is about strbuf, I would try to put it into strbuf.c Please don't. A code that happens to use strbuf as a container and about manipulating the contents is quite different from a code about strbuf. The latter is to enhance and extend how the strbuf as a container behaves. An operation about character encoding for a string that happens to be stored in strbuf is more about the encoding, and much much less about strbuf. convert.c is about massaging contents coming from the outside world into a shape stored in Git and the other way around, and there are multiple ways the contents are massaged. EOL convention may be adjusted, characters may be reencoded, end-user defined conversion may be applied. Some of these operations may use helpers specific for the task from other more library-ish files, like checking if a string looks like encoded in UTF-8 from utf8.[ch].
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 29 Dec 2017, at 13:59, Torsten Bögershausenwrote: > > On 2017-12-28 17:14, Lars Schneider wrote:> >>> On 17 Dec 2017, at 18:14, Torsten Bögershausen wrote: >>> >>> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: From: Lars Schneider >> +`encoding` +^^ + +By default Git assumes UTF-8 encoding for text files. The `encoding` >>> >>> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. >> >> I am not sure I understand what you want to tell me here. >> Do you think UTF-8 encoding is not correct in the sentence above? > > Git itself was designed to handle source code in ASCII. > Text files which are encoded in ISO-8859-1, UTF-8 or any > super-set of ASCII are handled as well, and what exactly to do > with bytes >0x80 is outside the responsibility of Git. > E.g. the interpretation and rendering on the screen may be > dependent on the locale or being guessed. > All in all, saying that Git expects UTF-8 may be "overdriven". > Any kind of file that uses an '\n' as an end of line > and has no NUL bytes '\0' works as a text file in Git. > (End of BlaBla ;-) > > We can probably replace: > "By default Git assumes UTF-8 encoding for text files" > > with something like > "Git handles UTF-8 encoded files as text files, but UTF-16 encoded > files are handled as binary files" Well, UTF-32 are handled as binary file, too :-) How about this? Git recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8 or ISO-8859-1) as text files. All other... > >>> +attribute sets the encoding to be used in the working directory. +If the path is added to the index, then Git encodes the content to +UTF-8. On checkout the content is encoded back to the original +encoding. Consequently, you can use all built-in Git text processing +tools (e.g. git diff, line ending conversions, etc.) with your +non-UTF-8 encoded file. + +Please note that re-encoding content can cause errors and requires +resources. Use the `encoding` attribute only if you cannot store +a file in UTF-8 encoding and if you want Git to be able to process +the text. + + +*.txt text encoding=UTF-16 + >>> >>> I think that encoding (or worktree-encoding as said elsewhere) must imply >>> text. >>> (That is not done in convert.c, but assuming binary or even auto >>> does not make much sense to me) >> >> "text" only means "LF". "-text" means CRLF which would be perfectly fine >> for UTF-16. Therefore I don't think we need to imply text. >> Does this make sense? > Yes and now. > > "text" means convert CRLF at "git add" (or commit) into LF, > And potentially LF into CRLF at checkout, > depending on the EOL attribute (if set), core.autocrlf and/or core.eol. > > "-text" means don't touch CRLF or LF at all. Files with CRLF are commited > with CRLF. > Running a Unix like "diff" tool will often show "^M" to indicate that there > is a CR before the LF, which may be distracting or confusing. > > If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system, > it makes sense to convert the line endings into LF before storing them > into the index (at least this is my experience). > > (Not specifying "text" or "-text" at all will trigger the auto-handling, > which is not needed at all). > So if we want to be sure that line endings are CRLF in the worktree we > should ask the user to specify things like this: > > *.ext worktree-encoding=UTF-16 text eol=CRLF > > It may be enough to give this example in the documentation. > and I would rather be over-careful here, to avoid future problems. Agreed. I'll add that example! + +All `iconv` encodings with a stable round-trip conversion to and from +UTF-8 are supported. You can see a full list with the following command: + + +iconv --list + + `eol` ^ diff --git a/convert.c b/convert.c index 20d7ab67bd..ee19c17104 100644 --- a/convert.c +++ b/convert.c @@ -7,6 +7,7 @@ #include "sigchain.h" #include "pkt-line.h" #include "sub-process.h" +#include "utf8.h" /* * convert.c - convert a file when checking it out and checking it in. @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, } >>> >>> I would avoid to use these #ifdefs here. >>> All functions can be in strbuf.c (and may have #ifdefs there), but not here. >> >> I'll try that. But wouldn't it make more sense to move the functions to >> utf.c? > > May be. > Originally utf8.c was about encoding and all kind of UTF-8 related stuff. > Especially it didn't know anything about strbuf. > I don't know why strbuf.h and other functions
Re: [PATCH v1] convert: add support for 'encoding' attribute
On 2017-12-28 17:14, Lars Schneider wrote:> >> On 17 Dec 2017, at 18:14, Torsten Bögershausenwrote: >> >> On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: >>> From: Lars Schneider >>> > >>> +`encoding` >>> +^^ >>> + >>> +By default Git assumes UTF-8 encoding for text files. The `encoding` >> >> This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. > > I am not sure I understand what you want to tell me here. > Do you think UTF-8 encoding is not correct in the sentence above? Git itself was designed to handle source code in ASCII. Text files which are encoded in ISO-8859-1, UTF-8 or any super-set of ASCII are handled as well, and what exactly to do with bytes >0x80 is outside the responsibility of Git. E.g. the interpretation and rendering on the screen may be dependent on the locale or being guessed. All in all, saying that Git expects UTF-8 may be "overdriven". Any kind of file that uses an '\n' as an end of line and has no NUL bytes '\0' works as a text file in Git. (End of BlaBla ;-) We can probably replace: "By default Git assumes UTF-8 encoding for text files" with something like "Git handles UTF-8 encoded files as text files, but UTF-16 encoded files are handled as binary files" >> >>> +attribute sets the encoding to be used in the working directory. >>> +If the path is added to the index, then Git encodes the content to >>> +UTF-8. On checkout the content is encoded back to the original >>> +encoding. Consequently, you can use all built-in Git text processing >>> +tools (e.g. git diff, line ending conversions, etc.) with your >>> +non-UTF-8 encoded file. >>> + >>> +Please note that re-encoding content can cause errors and requires >>> +resources. Use the `encoding` attribute only if you cannot store >>> +a file in UTF-8 encoding and if you want Git to be able to process >>> +the text. >>> + >>> + >>> +*.txt text encoding=UTF-16 >>> + >> >> I think that encoding (or worktree-encoding as said elsewhere) must imply >> text. >> (That is not done in convert.c, but assuming binary or even auto >> does not make much sense to me) > > "text" only means "LF". "-text" means CRLF which would be perfectly fine > for UTF-16. Therefore I don't think we need to imply text. > Does this make sense? Yes and now. "text" means convert CRLF at "git add" (or commit) into LF, And potentially LF into CRLF at checkout, depending on the EOL attribute (if set), core.autocrlf and/or core.eol. "-text" means don't touch CRLF or LF at all. Files with CRLF are commited with CRLF. Running a Unix like "diff" tool will often show "^M" to indicate that there is a CR before the LF, which may be distracting or confusing. If someone ever wants to handle the UTF-16 files on a Linux/Mac/Unix system, it makes sense to convert the line endings into LF before storing them into the index (at least this is my experience). (Not specifying "text" or "-text" at all will trigger the auto-handling, which is not needed at all). So if we want to be sure that line endings are CRLF in the worktree we should ask the user to specify things like this: *.ext worktree-encoding=UTF-16 text eol=CRLF It may be enough to give this example in the documentation. and I would rather be over-careful here, to avoid future problems. > >> >> >>> + >>> +All `iconv` encodings with a stable round-trip conversion to and from >>> +UTF-8 are supported. You can see a full list with the following command: >>> + >>> + >>> +iconv --list >>> + >>> + >>> `eol` >>> ^ >>> >>> diff --git a/convert.c b/convert.c >>> index 20d7ab67bd..ee19c17104 100644 >>> --- a/convert.c >>> +++ b/convert.c >>> @@ -7,6 +7,7 @@ >>> #include "sigchain.h" >>> #include "pkt-line.h" >>> #include "sub-process.h" >>> +#include "utf8.h" >>> >>> /* >>> * convert.c - convert a file when checking it out and checking it in. >>> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct >>> text_stat *stats, >>> >>> } >> >> I would avoid to use these #ifdefs here. >> All functions can be in strbuf.c (and may have #ifdefs there), but not here. > > I'll try that. But wouldn't it make more sense to move the functions to utf.c? May be. Originally utf8.c was about encoding and all kind of UTF-8 related stuff. Especially it didn't know anything about strbuf. I don't know why strbuf.h and other functions had been added here, I once moved them into strbuf.c without any problems, but never send out a patch, because of possible merge conflicts in ongoing patches. In any case, if it is about strbuf, I would try to put it into strbuf.c >> >>> >>> +#ifdef NO_ICONV >>> +#ifndef _ICONV_T >>> +/* The type is just a placeholder and not actually used. */ >>> +typedef void* iconv_t; >>> +#endif >>> +#endif >>> + >>> +static struct encoding { >>> + const char *name; >>> + iconv_t
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 17 Dec 2017, at 18:14, Torsten Bögershausenwrote: > > On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: >> From: Lars Schneider >> >> +`encoding` >> +^^ >> + >> +By default Git assumes UTF-8 encoding for text files. The `encoding` > > This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. I am not sure I understand what you want to tell me here. Do you think UTF-8 encoding is not correct in the sentence above? > >> +attribute sets the encoding to be used in the working directory. >> +If the path is added to the index, then Git encodes the content to >> +UTF-8. On checkout the content is encoded back to the original >> +encoding. Consequently, you can use all built-in Git text processing >> +tools (e.g. git diff, line ending conversions, etc.) with your >> +non-UTF-8 encoded file. >> + >> +Please note that re-encoding content can cause errors and requires >> +resources. Use the `encoding` attribute only if you cannot store >> +a file in UTF-8 encoding and if you want Git to be able to process >> +the text. >> + >> + >> +*.txt text encoding=UTF-16 >> + > > I think that encoding (or worktree-encoding as said elsewhere) must imply > text. > (That is not done in convert.c, but assuming binary or even auto > does not make much sense to me) "text" only means "LF". "-text" means CRLF which would be perfectly fine for UTF-16. Therefore I don't think we need to imply text. Does this make sense? > > >> + >> +All `iconv` encodings with a stable round-trip conversion to and from >> +UTF-8 are supported. You can see a full list with the following command: >> + >> + >> +iconv --list >> + >> + >> `eol` >> ^ >> >> diff --git a/convert.c b/convert.c >> index 20d7ab67bd..ee19c17104 100644 >> --- a/convert.c >> +++ b/convert.c >> @@ -7,6 +7,7 @@ >> #include "sigchain.h" >> #include "pkt-line.h" >> #include "sub-process.h" >> +#include "utf8.h" >> >> /* >> * convert.c - convert a file when checking it out and checking it in. >> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct >> text_stat *stats, >> >> } > > I would avoid to use these #ifdefs here. > All functions can be in strbuf.c (and may have #ifdefs there), but not here. I'll try that. But wouldn't it make more sense to move the functions to utf.c? > >> >> +#ifdef NO_ICONV >> +#ifndef _ICONV_T >> +/* The type is just a placeholder and not actually used. */ >> +typedef void* iconv_t; >> +#endif >> +#endif >> + >> +static struct encoding { >> +const char *name; >> +iconv_t to_git; /* conversion to Git's canonical encoding (UTF-8) >> */ >> +iconv_t to_worktree; /* conversion to user-defined encoding */ >> +struct encoding *next; >> +} *encoding, **encoding_tail; > > This seems like an optimazation, assuning that iconv_open() eats a lot of > ressources. > I don't think this is the case. (Undere MacOS we run iconv_open() together > with > every opendir(), and optimizing this out did not show any measurable > improvements) True, but then I would need to free() the memory in a lot of places. Therefore I thought it is easier to keep the object. OK for you? >> +static const char *default_encoding = "utf-8"; > > The most portable form is "UTF-8" (correct me if that is wrong) It shouldn't matter. But I've changed it to uppercase to be on the safe side. >> +static iconv_t invalid_conversion = (iconv_t)-1; >> + >> +static int encode_to_git(const char *path, const char *src, size_t src_len, >> + struct strbuf *buf, struct encoding *enc) >> +{ >> +#ifndef NO_ICONV >> +char *dst, *re_src; >> +int dst_len, re_src_len; >> + >> +/* >> + * No encoding is specified or there is nothing to encode. >> + * Tell the caller that the content was not modified. >> + */ >> +if (!enc || (src && !src_len)) >> +return 0; >> + >> +/* >> + * Looks like we got called from "would_convert_to_git()". >> + * This means Git wants to know if it would encode (= modify!) >> + * the content. Let's answer with "yes", since an encoding was >> + * specified. >> + */ >> +if (!buf && !src) >> +return 1; >> + >> +if (enc->to_git == invalid_conversion) { >> +enc->to_git = iconv_open(default_encoding, encoding->name); >> +if (enc->to_git == invalid_conversion) >> +warning(_("unsupported encoding %s"), encoding->name); >> +} > > /* There are 2 different types of reaction: > Either users know what that a warning means: You asked for > problems, > and do the right thing. Other may may ignore the warning > - in both cases an error is useful */ Agreed! >> +if (enc->to_worktree == invalid_conversion) >> +
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 18, 2017 at 08:12:49AM -0500, Jeff King wrote: > On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote: > > > Just to confirm my missing knowledge here: > > Does this mean, that git-gui and gitk can decode/reencode > > the content of a file/blob, when the .gitattributes say so ? > > That's my impression, yes. > > > If yes, would it make sense to enhance the "git diff" instead ? > > "git diff --encoding" will pick up the commited encoding from > > .attributes, convert it into UTF-8, and run the diff ? > > You can do something like this already: > > git config diff.utf16.textconv 'iconv -f utf16 -t utf8' > echo file diff=utf16 >.gitattributes > > I have no objection to teaching it that "encoding" means roughly the > same thing, which would have two advantages: > > - we could do it in-process, which would be much faster > > - we could skip the extra config step, which is a blocker for >server-side diffs on sites like GitHub > > I don't think you'd really need "--encoding". This could just be > triggered by the normal "--textconv" rules (i.e., just treat this "as > if" the textconv above was configured when we see an encoding > attribute). I can probably come up with a patch the next weeks or so. > > > We actually could enhance the "git diff" output with a single > > line saying > > "Git index-encoding=cp1251" > > or so, which can be picked up by "git apply". > > That extends it beyond what textconv can do (we assume that textconv > patches are _just_ for human consumption, and can't be applied). And it > would mean you'd potentially want to trigger it in more places. On the > other hand, I don't know that we're guaranteed that a round-trip > between encodings will produce a byte-wise identical result. The nice > thing about piggy-backing on textconv is that it's already dealt with > that problem. > > -Peff
Re: [PATCH v1] convert: add support for 'encoding' attribute
Am 18.12.2017 um 11:13 schrieb Torsten Bögershausen: Just to confirm my missing knowledge here: Does this mean, that git-gui and gitk can decode/reencode the content of a file/blob, when the .gitattributes say so ? No. I think they parse the output of git-diff et.al., split it per file, and then consult .gitattributes to interpret the byte sequence according to the encoding. -- Hannes
Re: [PATCH v1] convert: add support for 'encoding' attribute
Torsten Bögershausenwrites: > Just to confirm my missing knowledge here: > Does this mean, that git-gui and gitk can decode/reencode > the content of a file/blob, when the .gitattributes say so ? These programs, when told that a file is in an encoding, read bytes from that file and interpret that byte sequence in the given encoding, and blits the corresponding glyphs on the screen---giving UTF-8 out is not the primary feature of them, so I am not sure decode/reencode is a good term to use here.
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 18, 2017 at 11:13:34AM +0100, Torsten Bögershausen wrote: > Just to confirm my missing knowledge here: > Does this mean, that git-gui and gitk can decode/reencode > the content of a file/blob, when the .gitattributes say so ? That's my impression, yes. > If yes, would it make sense to enhance the "git diff" instead ? > "git diff --encoding" will pick up the commited encoding from > .attributes, convert it into UTF-8, and run the diff ? You can do something like this already: git config diff.utf16.textconv 'iconv -f utf16 -t utf8' echo file diff=utf16 >.gitattributes I have no objection to teaching it that "encoding" means roughly the same thing, which would have two advantages: - we could do it in-process, which would be much faster - we could skip the extra config step, which is a blocker for server-side diffs on sites like GitHub I don't think you'd really need "--encoding". This could just be triggered by the normal "--textconv" rules (i.e., just treat this "as if" the textconv above was configured when we see an encoding attribute). > We actually could enhance the "git diff" output with a single > line saying > "Git index-encoding=cp1251" > or so, which can be picked up by "git apply". That extends it beyond what textconv can do (we assume that textconv patches are _just_ for human consumption, and can't be applied). And it would mean you'd potentially want to trigger it in more places. On the other hand, I don't know that we're guaranteed that a round-trip between encodings will produce a byte-wise identical result. The nice thing about piggy-backing on textconv is that it's already dealt with that problem. -Peff
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 18, 2017 at 11:54:32AM +0100, Lars Schneider wrote: > > warning: failed to encode 'file' from utf-8 to utf16 > > > > At least it figured out that it couldn't convert the content. It's > > slightly troubling that it would try in the first place, though; are > > there encoding pairs where we might accidentally generate nonsense? > > At this point we interpret utf-16 content as utf-8 and try to convert > it to utf-16. That of course fails because utf-16 content is no valid > utf-8. How could we stop trying that? How could Git possibly know what > kind of encoding is used (apart from our new hint in gitattributes)? Yeah, sorry if I wasn't clear: I don't really have an answer to those questions either. So this is probably the best we can do. I was mostly just trying to think through the worst case, and what could go wrong. > > It may make sense to die() during "git add ." (since we're actually > > changing the index entry, and we don't want to put nonsense into a > > tree). But I'm not sure it's the best thing for operations which just > > want to read the content. For them, perhaps it would be more appropriate > > to issue a warning and return the untouched content. > > Absolutely! Thanks for spotting this. I will try to run die() only on > "git add" in v2. Great, thanks! -Peff
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 15 Dec 2017, at 10:58, Jeff Kingwrote: > > On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: > >> From: Lars Schneider >> >> Git and its tools (e.g. git diff) expect all text files in UTF-8 >> encoding. Git will happily accept content in all other encodings, too, >> but it might not be able to process the text (e.g. viewing diffs or >> changing line endings). >> >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the >> content to a canonical UTF-8 representation. On checkout Git will >> reverse the conversion. > > This made me wonder what happens when you have a file that _was_ utf16, > and then you later convert it to utf8 and add a .gitattributes entry. > > I tried this with your patch: > > git init repo > cd repo > > echo foo | iconv -t utf16 >file > git add file > git commit -m utf16 > > echo 'file encoding=utf16' >.gitattributes > touch file ;# make it stat-dirty > git commit -m convert > > git checkout HEAD^ > > That works OK, because we try to read the attributes from the > destination tree for a checkout. If you do this: > > echo 'file encoding=utf16' >.git/info/attributes > git checkout HEAD^ > > then we get: > > warning: failed to encode 'file' from utf-8 to utf16 > > At least it figured out that it couldn't convert the content. It's > slightly troubling that it would try in the first place, though; are > there encoding pairs where we might accidentally generate nonsense? At this point we interpret utf-16 content as utf-8 and try to convert it to utf-16. That of course fails because utf-16 content is no valid utf-8. How could we stop trying that? How could Git possibly know what kind of encoding is used (apart from our new hint in gitattributes)? I asked myself the question about nonsense encoding pairs, too. That was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git add. However, with ".git/info/attributes" you could circumvent that check and probably generate nonsense. > It _should_ be uncommon, I think, to have a repo-level attribute set > like that. It's very common for us to use whatever happens to be in the > checked-out .gitattributes for some attributes (e.g., when doing a diff > of an older commit), but I think for checkout-related ones it's not. So > I think it may generally work in practice. And certainly the line-ending > code would share any similar problems, but at least there the result is > less confusing than mojibake. > > Playing around, I also managed to do this: > > echo 'file encoding=utf16' >.gitattributes > echo foo >file > > # I did these with an old version of git that didn't > # support the new attribute, so they blindly added the utf8 content. > git add . > git commit -m convert > > git.compile checkout HEAD^ > > which yielded: > > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > Now obviously my situation is somewhat nonsense. I was trying to force > the in-repo representation to utf8, but ended up with a mismatched > working tree file. But what's somewhat troubling is that I couldn't > checkout _away_ from that state due to the die() in convert_to_git(). > Which is in turn just there as part of refresh_index(). > > And indeed, other commands hit the same problem: > > $ git.compile diff > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > $ git.compile checkout -f > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > It may make sense to die() during "git add ." (since we're actually > changing the index entry, and we don't want to put nonsense into a > tree). But I'm not sure it's the best thing for operations which just > want to read the content. For them, perhaps it would be more appropriate > to issue a warning and return the untouched content. Absolutely! Thanks for spotting this. I will try to run die() only on "git add" in v2. - Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 11, 2017 at 09:47:24PM +0100, Johannes Sixt wrote: > Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com: > >From: Lars Schneider> > > >Git and its tools (e.g. git diff) expect all text files in UTF-8 > >encoding. Git will happily accept content in all other encodings, too, > >but it might not be able to process the text (e.g. viewing diffs or > >changing line endings). > > > >Add an attribute to tell Git what encoding the user has defined for a > >given file. If the content is added to the index, then Git converts the > >content to a canonical UTF-8 representation. On checkout Git will > >reverse the conversion. > > > >Reviewed-by: Patrick Lühne > >Signed-off-by: Lars Schneider > >--- > > > >Hi, > > > >here is a WIP patch to add text encoding support for files encoded with > >something other than UTF-8 [RFC]. > > > >The 'encoding' attribute is already used to view blobs in gitk. That > >could be a problem as the content is stored in Git with the defined > >encoding. This patch would interpret the content as UTF-8 encoded and > > This will be a major drawback for me because my code base stores text files > that are not UTF-8 encoded. And I do use the existing 'encoding' attribute > to view the text in git-gui and gitk. Repurposing this attribute name is not > an option, IMO. Just to confirm my missing knowledge here: Does this mean, that git-gui and gitk can decode/reencode the content of a file/blob, when the .gitattributes say so ? If yes, would it make sense to enhance the "git diff" instead ? "git diff --encoding" will pick up the commited encoding from .attributes, convert it into UTF-8, and run the diff ? We actually could enhance the "git diff" output with a single line saying "Git index-encoding=cp1251" or so, which can be picked up by "git apply". The advantage would be that we could continue to commit in UTF-16 as before, and avoid the glitches with .gitattributes, that Peff pointed out. Does this make sense ? > > >it would try to reencode it to the defined encoding on checkout > Plus, > >many repos define the attribute very broad (e.g. "* > encoding=cp1251"). Is this a user mistake ? > >These folks would see errors like these with my patch: > > error: failed to encode 'foo.bar' from utf-8 to cp1251 > > -- Hannes
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: > From: Lars Schneider> > Git and its tools (e.g. git diff) expect all text files in UTF-8 > encoding. Git will happily accept content in all other encodings, too, > but it might not be able to process the text (e.g. viewing diffs or > changing line endings). > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the > content to a canonical UTF-8 representation. On checkout Git will > reverse the conversion. > > Reviewed-by: Patrick Lühne > Signed-off-by: Lars Schneider > --- > > Hi, > > here is a WIP patch to add text encoding support for files encoded with > something other than UTF-8 [RFC]. > > The 'encoding' attribute is already used to view blobs in gitk. That > could be a problem as the content is stored in Git with the defined > encoding. This patch would interpret the content as UTF-8 encoded and > it would try to reencode it to the defined encoding on checkout. > Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251"). > These folks would see errors like these with my patch: > error: failed to encode 'foo.bar' from utf-8 to cp1251 > > A quick search on GitHub reveals 2,722 repositories that use the > 'encoding' attribute [1]. Using the GitHub API [2], I identified the > following encodings in all publicly accessible repositories: > > ANSI<-- garbage (ignore by my implementation) > cp1251 > cp866 > cp950 > iso8859-1 > koi8-r > shiftjis<-- garbage (ignore by my implementation) > UTF-8 <-- unnecessary (ignore by my implementation) > utf8<-- garbage (ignore by my implementation) > > TODOs: > - The iconv docs mention that "errno == EINVAL" is harmless but > we don't handle that case in utf8.c:reencode_string_iconv() > - Git does not compile with NO_ICONV=1 right now because of > compat/precompose_utf8.c. I will send a patch to fix that. Minor: Does Git not compile under MacOS when setting NO_ICONV=1 ? This is possible not introduced by the patch. > > Questions: > - Should I use warning() or error() in the patch? > (currently I use the warning) Errors, see below. > - Do you agree with the approach in general? Yes, some comments inline. > > Thanks, > Lars > > > [RFC] > http://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com > [1] > https://github.com/search?p=1=encoding+filename%3Agitattributes=Code=%E2%9C%93 > [2] curl --user larsxschneider:secret -H 'Accept: > application/vnd.github.v3.text-match+json' > 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' > | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | > uniq > [3] > https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples > > > > > Notes: > Base Ref: master > Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d > Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && > git checkout afc9e88a4d > > Documentation/gitattributes.txt | 27 ++ > convert.c | 192 > +++- > t/t0028-encoding.sh | 60 + > 3 files changed, 278 insertions(+), 1 deletion(-) > create mode 100755 t/t0028-encoding.sh > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index 30687de81a..84de2fa49c 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -146,6 +146,33 @@ Unspecified:: > Any other value causes Git to act as if `text` has been left > unspecified. > > +`encoding` > +^^ > + > +By default Git assumes UTF-8 encoding for text files. The `encoding` This is probably "ASCII" and it's supersets like ISO-8859-1 or UTF-8. > +attribute sets the encoding to be used in the working directory. > +If the path is added to the index, then Git encodes the content to > +UTF-8. On checkout the content is encoded back to the original > +encoding. Consequently, you can use all built-in Git text processing > +tools (e.g. git diff, line ending conversions, etc.) with your > +non-UTF-8 encoded file. > + > +Please note that re-encoding content can cause errors and requires > +resources. Use the `encoding` attribute only if you cannot store > +a file in UTF-8 encoding and if you want Git to be able to process > +the text. > + > + > +*.txttext encoding=UTF-16 > + I think that encoding (or worktree-encoding as said elsewhere) must imply text. (That is not done in convert.c, but assuming binary or even auto does not make much sense to me) > + > +All `iconv` encodings with a stable round-trip conversion to and from > +UTF-8
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schnei...@autodesk.com wrote: > From: Lars Schneider> > Git and its tools (e.g. git diff) expect all text files in UTF-8 > encoding. Git will happily accept content in all other encodings, too, > but it might not be able to process the text (e.g. viewing diffs or > changing line endings). > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the > content to a canonical UTF-8 representation. On checkout Git will > reverse the conversion. This made me wonder what happens when you have a file that _was_ utf16, and then you later convert it to utf8 and add a .gitattributes entry. I tried this with your patch: git init repo cd repo echo foo | iconv -t utf16 >file git add file git commit -m utf16 echo 'file encoding=utf16' >.gitattributes touch file ;# make it stat-dirty git commit -m convert git checkout HEAD^ That works OK, because we try to read the attributes from the destination tree for a checkout. If you do this: echo 'file encoding=utf16' >.git/info/attributes git checkout HEAD^ then we get: warning: failed to encode 'file' from utf-8 to utf16 At least it figured out that it couldn't convert the content. It's slightly troubling that it would try in the first place, though; are there encoding pairs where we might accidentally generate nonsense? It _should_ be uncommon, I think, to have a repo-level attribute set like that. It's very common for us to use whatever happens to be in the checked-out .gitattributes for some attributes (e.g., when doing a diff of an older commit), but I think for checkout-related ones it's not. So I think it may generally work in practice. And certainly the line-ending code would share any similar problems, but at least there the result is less confusing than mojibake. Playing around, I also managed to do this: echo 'file encoding=utf16' >.gitattributes echo foo >file # I did these with an old version of git that didn't # support the new attribute, so they blindly added the utf8 content. git add . git commit -m convert git.compile checkout HEAD^ which yielded: fatal: encoding 'file' from utf16 to utf-8 and back is not the same Now obviously my situation is somewhat nonsense. I was trying to force the in-repo representation to utf8, but ended up with a mismatched working tree file. But what's somewhat troubling is that I couldn't checkout _away_ from that state due to the die() in convert_to_git(). Which is in turn just there as part of refresh_index(). And indeed, other commands hit the same problem: $ git.compile diff fatal: encoding 'file' from utf16 to utf-8 and back is not the same $ git.compile checkout -f fatal: encoding 'file' from utf16 to utf-8 and back is not the same It may make sense to die() during "git add ." (since we're actually changing the index entry, and we don't want to put nonsense into a tree). But I'm not sure it's the best thing for operations which just want to read the content. For them, perhaps it would be more appropriate to issue a warning and return the untouched content. -Peff
Re: [PATCH v1] convert: add support for 'encoding' attribute
Lars Schneiderwrites: > That way you could control the encoding for a text file specific > for each path similar to the "mode bits". That also means you could > change the encoding of a file while the blob content stays the same. That is exactly why I said that I doubt it makes sense. When you have the same blob object (that is in UTF-8) appear at two places in a tree (or two different subtrees inside a single tree) and the two tree entries that point at the blob tells contradicting story, you would have a checkout of the same contents in two different encodings. If you have the same blob object appear in two adjacent commits at the same path, with one commit's tree recording its encoding differently from the other's, you would switch encodings when you switch branches. I doubt these are "features", but sources of confusion. Be it a feature, or a misfeature, it is shared with the existing solution which is .gitattributes embedded in the tree, so you are not making anything better or worse by moving the information to tree entry. What I actually expected to hear when somebody talks about "ideal" (which may not even be achievable given existing user base) was to make blob object declare its desired external representation. That would remove the two possible confusion cited above, because once you have a blob, you have everything needed to externalize it. In any case, I do not think we'd go there anyway, so ...
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 13 Dec 2017, at 19:11, Junio C Hamanowrote: > > Lars Schneider writes: > >> ... In a perfect world I think I would store >> the encoding of a file in the tree object. I didn't pursue that solution >> as this would change the Git data model which would open a can of worms >> for a problem that not that many people have (almost everyone is on >> UTF-8 anyways). > > Having that "encoding" trailt recorded in the tree that contains the > blob would mean that the same blob can be recorded with one > "encoding" trait in a tree, and in a different tree it can be > recorded with a different "encoding" trait. I doubt it really makes > sense. The "blob object" would store the text data encoded in a canonical UTF-8 form. The "tree object" would store the encoding. On checkin Git would convert the text from the stored encoding to UTF-8 and on checkout it would do the reverse. That way you could control the encoding for a text file specific for each path similar to the "mode bits". That also means you could change the encoding of a file while the blob content stays the same. Changing the encoding of a file with the .gitattributes approach can be difficult if you have configured the attribute with a very broad pattern (e.g. *.foo). You would either need to rename the file or limit the scope of your pattern. - Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
Lars Schneiderwrites: > ... In a perfect world I think I would store > the encoding of a file in the tree object. I didn't pursue that solution > as this would change the Git data model which would open a can of worms > for a problem that not that many people have (almost everyone is on > UTF-8 anyways). Having that "encoding" trailt recorded in the tree that contains the blob would mean that the same blob can be recorded with one "encoding" trait in a tree, and in a different tree it can be recorded with a different "encoding" trait. I doubt it really makes sense.
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 12 Dec 2017, at 20:31, Junio C Hamanowrote: > > Lars Schneider writes: > >> Our favorite is "treat-encoding-as". Do you consider this better >> or worse than "checkout-encoding"? > > I am afraid that "treat as" is not sufficiently specific and would > invite a misinterpretation, e.g. "You record the bytes I throw at > you as-is in the object store, but treat them appropriately as > contents that are encoded in cp1252 when presenting". > > what is missing is at what stage in the overall user experience does > that "treating" happens. That causes such a misinterpretation. > > So from that point of view, "checkout-" or" working-tree-" would be > a better phrase to accompany "encoding" to clarify what this attr is > for than "treat-as". Alright. I'll go with "checkout-encoding" then. > Having said all that, this "feature" would need a moderate amount of > clear description in the documentation, and between the perfect and > a suboptimal name, the amount of explanation required would not be > all that different, I suspect. We need to say that those who wish > to use this attribute are buying into recording their contents in > UTF-8 and when the contents are externalized to the working tree, > they are converted to the encoding the value of this attribute > specifies and vice versa. Absolutely! The docs will come in v2. I just wanted to send out a v1 to make people aware of what I am working on. I think storing the encoding in gitattributes might not be the perfect solution but probably a pragmatic one (because it works and the changes to Git are rather small). In a perfect world I think I would store the encoding of a file in the tree object. I didn't pursue that solution as this would change the Git data model which would open a can of worms for a problem that not that many people have (almost everyone is on UTF-8 anyways). - Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
Lars Schneiderwrites: > Our favorite is "treat-encoding-as". Do you consider this better > or worse than "checkout-encoding"? I am afraid that "treat as" is not sufficiently specific and would invite a misinterpretation, e.g. "You record the bytes I throw at you as-is in the object store, but treat them appropriately as contents that are encoded in cp1252 when presenting". what is missing is at what stage in the overall user experience does that "treating" happens. That causes such a misinterpretation. So from that point of view, "checkout-" or" working-tree-" would be a better phrase to accompany "encoding" to clarify what this attr is for than "treat-as". Having said all that, this "feature" would need a moderate amount of clear description in the documentation, and between the perfect and a suboptimal name, the amount of explanation required would not be all that different, I suspect. We need to say that those who wish to use this attribute are buying into recording their contents in UTF-8 and when the contents are externalized to the working tree, they are converted to the encoding the value of this attribute specifies and vice versa.
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 12 Dec 2017, at 00:58, Eric Sunshinewrote: > > On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneider > wrote: >> On 11 Dec 2017, at 19:39, Eric Sunshine wrote: >>> On Mon, Dec 11, 2017 at 10:50 AM, wrote: From: Lars Schneider Git and its tools (e.g. git diff) expect all text files in UTF-8 encoding. Git will happily accept content in all other encodings, too, but it might not be able to process the text (e.g. viewing diffs or changing line endings). Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git converts the content to a canonical UTF-8 representation. On checkout Git will reverse the conversion. Reviewed-by: Patrick Lühne Signed-off-by: Lars Schneider --- +static int encode_to_git(const char *path, const char *src, size_t src_len, +struct strbuf *buf, struct encoding *enc) +{ + if (enc->to_git == invalid_conversion) { + enc->to_git = iconv_open(default_encoding, encoding->name); + if (enc->to_git == invalid_conversion) + warning(_("unsupported encoding %s"), encoding->name); + } + + if (enc->to_worktree == invalid_conversion) + enc->to_worktree = iconv_open(encoding->name, default_encoding); >>> >>> Do you need to be calling iconv_close() somewhere on the result of the >>> iconv_open() calls? [Answering myself after reading the rest of the >>> patch: You're caching these opened 'iconv' descriptors, so you don't >>> plan on closing them.] >> >> Should this information go into the commit message to avoid confusing >> future readers? I think, yes. > > Maybe. However, the code which does the actual caching is so distant > from these iconv_open() invocations that it might be more helpful to > have an in-code comment here saying that the "missing" iconv_close() > invocations is intentional. Agreed. I will add that in v2. Thanks, Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
> On 12 Dec 2017, at 08:15, Johannes Sixtwrote: > > Am 12.12.2017 um 01:59 schrieb Junio C Hamano: >> Stepping back a bit, what does this thing do you are introducing? >> And what does the other thing do that J6t is using, that would get >> confused with this new one? >> What does the other one do? "Declare that the contents of this path >> is in this encoding"? As opposed to the new one, which tells Git to >> "run iconv from and to this encoding when checking out and checking >> in"? >> If so, any phrase that depends heavily on the word "encode" would >> not help differenciating the two uses. The phrase needs to be >> something that contrasts the new one, which actively modifies things >> (what is on the filesystem is not what is stored in the object >> store), with the old one, which does not (passed as a declaration to >> a viewer what encoding the contents already use and does not change >> anything). > > Well explained! > >> ... perhaps "smudge-encoding" would work (we declare that the >> result of smudge operations are left in this encoding, so the >> opposite operation "clean" will do the reverse---and we say this >> without explicitly saying that the other end of the conversion is >> always UTF-8)? Or "checkout-encoding" (the same explanation; we do >> not say the opposite operation "checkin/add" will do the reverse). > > I would favor "checkout-encoding" over "smudge-encoding" only because > "checkout" is better known than "smudge", I would think. I do not have better > suggestions. Thanks for your thoughts! "checkout-encoding" would work for me. However, it might convey that Git "does change the files of the user in some way" (which is true from Git's perspective but not from the user perspective). Patrick and I brainstormed a few more possible alternatives: apply-encoding blob-encoding checkout-encoding diff-encoding encoding-hint external-encoding handle-as internal-encoding keep-encoding present-as preserve-encoding source-encoding text-conversion text-encoding treat-as treat-encoding-as Our favorite is "treat-encoding-as". Do you consider this better or worse than "checkout-encoding"? Thanks, Lars PS: Naming things is hard ;-)
Re: [PATCH v1] convert: add support for 'encoding' attribute
Am 12.12.2017 um 01:59 schrieb Junio C Hamano: Stepping back a bit, what does this thing do you are introducing? And what does the other thing do that J6t is using, that would get confused with this new one? What does the other one do? "Declare that the contents of this path is in this encoding"? As opposed to the new one, which tells Git to "run iconv from and to this encoding when checking out and checking in"? If so, any phrase that depends heavily on the word "encode" would not help differenciating the two uses. The phrase needs to be something that contrasts the new one, which actively modifies things (what is on the filesystem is not what is stored in the object store), with the old one, which does not (passed as a declaration to a viewer what encoding the contents already use and does not change anything). Well explained! ... perhaps "smudge-encoding" would work (we declare that the result of smudge operations are left in this encoding, so the opposite operation "clean" will do the reverse---and we say this without explicitly saying that the other end of the conversion is always UTF-8)? Or "checkout-encoding" (the same explanation; we do not say the opposite operation "checkin/add" will do the reverse). I would favor "checkout-encoding" over "smudge-encoding" only because "checkout" is better known than "smudge", I would think. I do not have better suggestions. -- Hannes
Re: [PATCH v1] convert: add support for 'encoding' attribute
Am 12.12.2017 um 00:42 schrieb Lars Schneider: BTW: I am curios, can you share what encoding you use? My main use case is UTF-16 and I was surprised that I haven't found a single public repo on github.com with "encoding=utf-16" Shift-JIS and CP1252. These are used for Windows resource files (*.rc). -- Hannes
Re: [PATCH v1] convert: add support for 'encoding' attribute
Lars Schneiderwrites: > I contemplated: > - "enc" or "encode" because "eol" and "ident" use abbreviations, too > (enc could be confused with encryption. plus, a user might ask > what is the difference between "enc" and "encoding" attribute :-) > - "wte", "wtenc", or "worktree-encoding" to emphasize that this is > the encoding used in the worktree > (I fear that users think that is git-worktree, the command, related) In the context of Git, the word "worktree" does have a specific meaning that is different from working tree. Stepping back a bit, what does this thing do you are introducing? And what does the other thing do that J6t is using, that would get confused with this new one? What does the other one do? "Declare that the contents of this path is in this encoding"? As opposed to the new one, which tells Git to "run iconv from and to this encoding when checking out and checking in"? If so, any phrase that depends heavily on the word "encode" would not help differenciating the two uses. The phrase needs to be something that contrasts the new one, which actively modifies things (what is on the filesystem is not what is stored in the object store), with the old one, which does not (passed as a declaration to a viewer what encoding the contents already use and does not change anything). Do people who will use this feature familiar with the concept of smudge/clean? If you want to avoid "working-tree" (or "worktree", which definitely you would want to avoid) because you fear confused users, perhaps "smudge-encoding" would work (we declare that the result of smudge operations are left in this encoding, so the opposite operation "clean" will do the reverse---and we say this without explicitly saying that the other end of the conversion is always UTF-8)? Or "checkout-encoding" (the same explanation; we do not say the opposite operation "checkin/add" will do the reverse). I personally do not think "working-tree-encoding" is too horrible, but I do agree that some users may be confused. So I dunno.
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 11, 2017 at 6:47 PM, Lars Schneiderwrote: > On 11 Dec 2017, at 19:39, Eric Sunshine wrote: >> On Mon, Dec 11, 2017 at 10:50 AM, wrote: >>> From: Lars Schneider >>> >>> Git and its tools (e.g. git diff) expect all text files in UTF-8 >>> encoding. Git will happily accept content in all other encodings, too, >>> but it might not be able to process the text (e.g. viewing diffs or >>> changing line endings). >>> >>> Add an attribute to tell Git what encoding the user has defined for a >>> given file. If the content is added to the index, then Git converts the >>> content to a canonical UTF-8 representation. On checkout Git will >>> reverse the conversion. >>> >>> Reviewed-by: Patrick Lühne >>> Signed-off-by: Lars Schneider >>> --- >>> +static int encode_to_git(const char *path, const char *src, size_t src_len, >>> +struct strbuf *buf, struct encoding *enc) >>> +{ >>> + if (enc->to_git == invalid_conversion) { >>> + enc->to_git = iconv_open(default_encoding, encoding->name); >>> + if (enc->to_git == invalid_conversion) >>> + warning(_("unsupported encoding %s"), >>> encoding->name); >>> + } >>> + >>> + if (enc->to_worktree == invalid_conversion) >>> + enc->to_worktree = iconv_open(encoding->name, >>> default_encoding); >> >> Do you need to be calling iconv_close() somewhere on the result of the >> iconv_open() calls? [Answering myself after reading the rest of the >> patch: You're caching these opened 'iconv' descriptors, so you don't >> plan on closing them.] > > Should this information go into the commit message to avoid confusing > future readers? I think, yes. Maybe. However, the code which does the actual caching is so distant from these iconv_open() invocations that it might be more helpful to have an in-code comment here saying that the "missing" iconv_close() invocations is intentional.
Re: [PATCH v1] convert: add support for 'encoding' attribute
On 11 Dec 2017, at 19:39, Eric Sunshinewrote: > On Mon, Dec 11, 2017 at 10:50 AM, wrote: >> From: Lars Schneider >> >> Git and its tools (e.g. git diff) expect all text files in UTF-8 >> encoding. Git will happily accept content in all other encodings, too, >> but it might not be able to process the text (e.g. viewing diffs or >> changing line endings). >> >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the >> content to a canonical UTF-8 representation. On checkout Git will >> reverse the conversion. >> >> Reviewed-by: Patrick Lühne >> Signed-off-by: Lars Schneider >> --- >> diff --git a/convert.c b/convert.c >> @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct >> text_stat *stats, >> +static int encode_to_git(const char *path, const char *src, size_t src_len, >> +struct strbuf *buf, struct encoding *enc) >> +{ >> +#ifndef NO_ICONV >> + char *dst, *re_src; >> + int dst_len, re_src_len; >> + >> + /* >> +* No encoding is specified or there is nothing to encode. >> +* Tell the caller that the content was not modified. >> +*/ >> + if (!enc || (src && !src_len)) >> + return 0; >> + >> + /* >> +* Looks like we got called from "would_convert_to_git()". >> +* This means Git wants to know if it would encode (= modify!) >> +* the content. Let's answer with "yes", since an encoding was >> +* specified. >> +*/ >> + if (!buf && !src) >> + return 1; >> + >> + if (enc->to_git == invalid_conversion) { >> + enc->to_git = iconv_open(default_encoding, encoding->name); >> + if (enc->to_git == invalid_conversion) >> + warning(_("unsupported encoding %s"), >> encoding->name); >> + } >> + >> + if (enc->to_worktree == invalid_conversion) >> + enc->to_worktree = iconv_open(encoding->name, >> default_encoding); > > Do you need to be calling iconv_close() somewhere on the result of the > iconv_open() calls? [Answering myself after reading the rest of the > patch: You're caching these opened 'iconv' descriptors, so you don't > plan on closing them.] Should this information go into the commit message to avoid confusing future readers? I think, yes. >> + [...] >> + /* >> +* Encode dst back to ensure no information is lost. This wastes >> +* a few cycles as most conversions are round trip conversion >> +* safe. However, content that has an invalid encoding might not >> +* match its original byte sequence after the UTF-8 conversion >> +* round trip. Let's play safe here and check the round trip >> +* conversion. >> +*/ >> + re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, >> _src_len); >> + if (!re_src || strcmp(src, re_src)) { > > You're using strcmp() as opposed to memcmp() because you expect > 're_src' will unconditionally be UTF-8-encoded, right? Yes. But since you mention it, I think it would be better to use memcmp() here! Plus, it might be a tiny bit faster ;-) Thanks, Lars
Re: [PATCH v1] convert: add support for 'encoding' attribute
On 11 Dec 2017, at 21:47, Johannes Sixtwrote: > Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com: >> From: Lars Schneider >> Git and its tools (e.g. git diff) expect all text files in UTF-8 >> encoding. Git will happily accept content in all other encodings, too, >> but it might not be able to process the text (e.g. viewing diffs or >> changing line endings). >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the >> content to a canonical UTF-8 representation. On checkout Git will >> reverse the conversion. >> Reviewed-by: Patrick Lühne >> Signed-off-by: Lars Schneider >> --- >> Hi, >> here is a WIP patch to add text encoding support for files encoded with >> something other than UTF-8 [RFC]. >> The 'encoding' attribute is already used to view blobs in gitk. That >> could be a problem as the content is stored in Git with the defined >> encoding. This patch would interpret the content as UTF-8 encoded and > > This will be a major drawback for me because my code base stores text files > that are not UTF-8 encoded. And I do use the existing 'encoding' attribute to > view the text in git-gui and gitk. Repurposing this attribute name is not an > option, IMO. I understand your point of view and I kind of expected that that reply. Thanks for the feedback! Question is: Given that "encoding" is not available, how could I name the attribute without confusing the user? I contemplated: - "enc" or "encode" because "eol" and "ident" use abbreviations, too (enc could be confused with encryption. plus, a user might ask what is the difference between "enc" and "encoding" attribute :-) - "wte", "wtenc", or "worktree-encoding" to emphasize that this is the encoding used in the worktree (I fear that users think that is git-worktree, the command, related) I think my favorite is "worktree-encoding". What do you think? Thanks, Lars BTW: I am curios, can you share what encoding you use? My main use case is UTF-16 and I was surprised that I haven't found a single public repo on github.com with "encoding=utf-16"
Re: [PATCH v1] convert: add support for 'encoding' attribute
Am 11.12.2017 um 16:50 schrieb lars.schnei...@autodesk.com: From: Lars SchneiderGit and its tools (e.g. git diff) expect all text files in UTF-8 encoding. Git will happily accept content in all other encodings, too, but it might not be able to process the text (e.g. viewing diffs or changing line endings). Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git converts the content to a canonical UTF-8 representation. On checkout Git will reverse the conversion. Reviewed-by: Patrick Lühne Signed-off-by: Lars Schneider --- Hi, here is a WIP patch to add text encoding support for files encoded with something other than UTF-8 [RFC]. The 'encoding' attribute is already used to view blobs in gitk. That could be a problem as the content is stored in Git with the defined encoding. This patch would interpret the content as UTF-8 encoded and This will be a major drawback for me because my code base stores text files that are not UTF-8 encoded. And I do use the existing 'encoding' attribute to view the text in git-gui and gitk. Repurposing this attribute name is not an option, IMO. it would try to reencode it to the defined encoding on checkout > Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251"). These folks would see errors like these with my patch: error: failed to encode 'foo.bar' from utf-8 to cp1251 -- Hannes
Re: [PATCH v1] convert: add support for 'encoding' attribute
On Mon, Dec 11, 2017 at 10:50 AM,wrote: > From: Lars Schneider > > Git and its tools (e.g. git diff) expect all text files in UTF-8 > encoding. Git will happily accept content in all other encodings, too, > but it might not be able to process the text (e.g. viewing diffs or > changing line endings). > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the > content to a canonical UTF-8 representation. On checkout Git will > reverse the conversion. > > Reviewed-by: Patrick Lühne > Signed-off-by: Lars Schneider > --- > diff --git a/convert.c b/convert.c > @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct > text_stat *stats, > +static int encode_to_git(const char *path, const char *src, size_t src_len, > +struct strbuf *buf, struct encoding *enc) > +{ > +#ifndef NO_ICONV > + char *dst, *re_src; > + int dst_len, re_src_len; > + > + /* > +* No encoding is specified or there is nothing to encode. > +* Tell the caller that the content was not modified. > +*/ > + if (!enc || (src && !src_len)) > + return 0; > + > + /* > +* Looks like we got called from "would_convert_to_git()". > +* This means Git wants to know if it would encode (= modify!) > +* the content. Let's answer with "yes", since an encoding was > +* specified. > +*/ > + if (!buf && !src) > + return 1; > + > + if (enc->to_git == invalid_conversion) { > + enc->to_git = iconv_open(default_encoding, encoding->name); > + if (enc->to_git == invalid_conversion) > + warning(_("unsupported encoding %s"), encoding->name); > + } > + > + if (enc->to_worktree == invalid_conversion) > + enc->to_worktree = iconv_open(encoding->name, > default_encoding); Do you need to be calling iconv_close() somewhere on the result of the iconv_open() calls? [Answering myself after reading the rest of the patch: You're caching these opened 'iconv' descriptors, so you don't plan on closing them.] > + [...] > + /* > +* Encode dst back to ensure no information is lost. This wastes > +* a few cycles as most conversions are round trip conversion > +* safe. However, content that has an invalid encoding might not > +* match its original byte sequence after the UTF-8 conversion > +* round trip. Let's play safe here and check the round trip > +* conversion. > +*/ > + re_src = reencode_string_iconv(dst, dst_len, enc->to_worktree, > _src_len); > + if (!re_src || strcmp(src, re_src)) { You're using strcmp() as opposed to memcmp() because you expect 're_src' will unconditionally be UTF-8-encoded, right? > + die(_("encoding '%s' from %s to %s and back is not the same"), > + path, enc->name, default_encoding); > + } > + free(re_src); > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +#else > + warning(_("cannot encode '%s' from %s to %s because " > + "your Git was not compiled with encoding support"), > + path, enc->name, default_encoding); > + return 0; > +#endif > +}
[PATCH v1] convert: add support for 'encoding' attribute
From: Lars SchneiderGit and its tools (e.g. git diff) expect all text files in UTF-8 encoding. Git will happily accept content in all other encodings, too, but it might not be able to process the text (e.g. viewing diffs or changing line endings). Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git converts the content to a canonical UTF-8 representation. On checkout Git will reverse the conversion. Reviewed-by: Patrick Lühne Signed-off-by: Lars Schneider --- Hi, here is a WIP patch to add text encoding support for files encoded with something other than UTF-8 [RFC]. The 'encoding' attribute is already used to view blobs in gitk. That could be a problem as the content is stored in Git with the defined encoding. This patch would interpret the content as UTF-8 encoded and it would try to reencode it to the defined encoding on checkout. Plus, many repos define the attribute very broad (e.g. "* encoding=cp1251"). These folks would see errors like these with my patch: error: failed to encode 'foo.bar' from utf-8 to cp1251 A quick search on GitHub reveals 2,722 repositories that use the 'encoding' attribute [1]. Using the GitHub API [2], I identified the following encodings in all publicly accessible repositories: ANSI<-- garbage (ignore by my implementation) cp1251 cp866 cp950 iso8859-1 koi8-r shiftjis<-- garbage (ignore by my implementation) UTF-8 <-- unnecessary (ignore by my implementation) utf8<-- garbage (ignore by my implementation) TODOs: - The iconv docs mention that "errno == EINVAL" is harmless but we don't handle that case in utf8.c:reencode_string_iconv() - Git does not compile with NO_ICONV=1 right now because of compat/precompose_utf8.c. I will send a patch to fix that. Questions: - Should I use warning() or error() in the patch? (currently I use the warning) - Do you agree with the approach in general? Thanks, Lars [RFC] http://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com [1] https://github.com/search?p=1=encoding+filename%3Agitattributes=Code=%E2%9C%93 [2] curl --user larsxschneider:secret -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/code?q=encoding+in:file+filename:gitattributes' | jq -r '.items[].text_matches[].fragment' | sed 's/.*encoding=//' | sort | uniq [3] https://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html#iconv-Examples Notes: Base Ref: master Web-Diff: https://github.com/larsxschneider/git/commit/afc9e88a4d Checkout: git fetch https://github.com/larsxschneider/git encoding-v1 && git checkout afc9e88a4d Documentation/gitattributes.txt | 27 ++ convert.c | 192 +++- t/t0028-encoding.sh | 60 + 3 files changed, 278 insertions(+), 1 deletion(-) create mode 100755 t/t0028-encoding.sh diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 30687de81a..84de2fa49c 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -146,6 +146,33 @@ Unspecified:: Any other value causes Git to act as if `text` has been left unspecified. +`encoding` +^^ + +By default Git assumes UTF-8 encoding for text files. The `encoding` +attribute sets the encoding to be used in the working directory. +If the path is added to the index, then Git encodes the content to +UTF-8. On checkout the content is encoded back to the original +encoding. Consequently, you can use all built-in Git text processing +tools (e.g. git diff, line ending conversions, etc.) with your +non-UTF-8 encoded file. + +Please note that re-encoding content can cause errors and requires +resources. Use the `encoding` attribute only if you cannot store +a file in UTF-8 encoding and if you want Git to be able to process +the text. + + +*.txt text encoding=UTF-16 + + +All `iconv` encodings with a stable round-trip conversion to and from +UTF-8 are supported. You can see a full list with the following command: + + +iconv --list + + `eol` ^ diff --git a/convert.c b/convert.c index 20d7ab67bd..ee19c17104 100644 --- a/convert.c +++ b/convert.c @@ -7,6 +7,7 @@ #include "sigchain.h" #include "pkt-line.h" #include "sub-process.h" +#include "utf8.h" /* * convert.c - convert a file when checking it out and checking it in. @@ -256,6 +257,149 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, } +#ifdef NO_ICONV +#ifndef _ICONV_T +/* The type is just a placeholder and not actually used. */ +typedef void* iconv_t; +#endif +#endif + +static struct encoding { + const char *name; +