On 28.04.14 21:35, Jeff King wrote:
> On Mon, Apr 28, 2014 at 12:17:28PM -0700, Junio C Hamano wrote:
>>>   3. Convert index filenames to their precomposed form when
>>>      we read the index from disk. This would be efficient,
>>>      but we would have to be careful not to write the
>>>      precomposed forms back out to disk.
>> I think this may be the right approach, especially if you are going
>> to do this only when core.precomposeunicode is set.
>> the reasoning behind "we would have to be careful not to write"
>> part, is unclear to me, though.  Don't decomposing filesystems
>> perform the manglig from the precomposed form without even being
>> asked to do so, just like a case insensitive filesystem will
>> overwrite an existing "makefile" on a request to write to
>> "Makefile"?
> Sorry, I meant "do not write the precomposed forms back out to the
> on-disk index". And by extension, do not update cache-tree and write
> them out to git trees.
> IOW, it is not enough to just set cache_entry->name to the normalized
> form. You'd need to store both.
> Since such entries are in the minority, and because cache_entry is
> already a variable-length struct, I think you could get away with
> sticking it after the "name" field, and then comparing like:
>   const char *ce_normalized_name(struct cache_entry *ce, size_t *len)
>   {
>       const char *ret;
>       /* Normal, fast path */
>       if (!(ce->ce_flags & CE_NORMALIZED_NAME)) {
>               len = ce_namelen(ce);
>               return ce->name;
>       }
>       /* Slow path for normalized names */
>       ret = ce->name + ce->namelen + 1;
>       *len = strlen(name);
>       return ret;
>   }
> The strlen is probably OK since such paths are presumably in the
> minority (even for UTF-8 paths, we can avoid storing the extra copy if
> they do not need any normalization). Or we could get fancy and encode
> the length in front, but I am not sure it is worth the complexity.
> Anyway, the tricky part is then making sure that all cache_entry name
> comparisons use ce_normalized_name instead of ce->name.
> -Peff
To my knowledge repos with decomposed unicode should be rare in practice.
I only can speak for european (or latin based) or cyrillic languages myself:

- It is difficult (but not impossible) to enter decomposed unicode on the 
- Some programs under Mac OS X do not handle decomposed code points well,
  an "ä" may be displayed as "¨a" for example.
- Pushing and pulling to Windows or Linux is possible, but the same problems 
  the keyboard is not prepared to enter the decomposed form, and the display 
may be wrong.

The only possible use case for decomposed unicode I am aware of is when you use 
because bzr does not do the precomposition (and neither hg to my knowledge).

So for me the test case could sense, even if I think that nobody (TM) uses an 
old Git version
under Mac OS X which is not able to handle precomposed unicode.

Unless I have missed something.

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to