Re: [RFC] Non-normalizing Unicode Composition Awareness

Thomas Åkesson Mon, 16 Apr 2012 20:25:28 -0700

Hi,
A bit of a status update on the wiki article:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness


Received some comments from Daniel, which I have tried to address. Thanks. 

I have written a bash script which demonstrates the concept of "Alternative 1" 
with regards to how the local_relpath column is handled by checkout/update. 

From the wiki:
---
This alternative can be simulated using the attached script 
localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout 
should produce if this alternative was implemented in Subversion itself:

svn co ...
svn stat #Shows any problematic items as missing and unversioned
localrelpath2nfd.sh
svn stat #Should be clean apart from misperception that some items are switched
---

This script can be used to investigate how other subcommands are affected and 
determine what needs to be done. It is possible to make commits but updates to 
normalisation-dependent nodes will fail since this script is not inside the 
update code. 

I intend to use this script to take the design to the next level of detail. 
First, I would like some feedback from people with in-depth knowledge of the WC 
and preferably get some idea on what the community thinks about the approach. 

/Thomas Å. 


On 26 mar 2012, at 04:14, Thomas Åkesson <[email protected]> wrote:

> Hi,
> Sorry about the delay, had a release to sort out...
> 
> I have moved the proposal into the wiki:
> http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness
> 
> The comments from Julian and Markus have been implemented and I have added 
> more information to the "Client Changes" section as well as more structure 
> and TODO-notes. 
> 
> I would really appreciate if someone with more insight into WC-NG could 
> provide input on some of the TODO items (or things that have been completely 
> overlooked).
> 
> Thanks,
> Thomas Å.
> 
> 
> On 21 feb 2012, at 09:55, Daniel Shahaf wrote:
> 
>> I've granted you write access to the wiki.
>> 
>> Thomas Åkesson wrote on Tue, Feb 14, 2012 at 12:36:23 +0100:
>>> Thanks Julian and Markus for providing feedback. 
>>> 
>>> I am not commenting below because all the feedback is very good and I will 
>>> try to address it as best I can in the next iteration. Describing the 
>>> behaviour changes to the WC is the most challenging since I lack that kind 
>>> of detailed knowledge. I will instead try to draft the structure of that 
>>> section to make it easier for someone with that level of detail to assist.
>>> 
>>> Regarding use cases, what can I say... it was towards the end of a long 
>>> stretch.
>>> 
>>> I think it would help with the upcoming iterations if I could move this 
>>> "document" into the wiki. If you find that this first draft shows promise, 
>>> please consider granting edit access in the wiki. My user name is "Thomas 
>>> Åkesson", which exercises the Unicode awareness of MoinMoin...
>>> 
>>> /Thomas Å.
>>> 
>>> 
>>> On 14 feb 2012, at 11:25, Julian Foad wrote:
>>> 
>>>> Hi Thomas.  It's fantastic that you're taking the trouble to write up this 
>>>> proposal.  That's just what we need.  Just a few initial comments below...
>>>> 
>>>> Thomas Åkesson wrote:
>>>> 
>>>>> Context
>>>>> ===
>>>>> 
>>>>> [...] A unicode string (e.g. a file name) can be represented
>>>>> in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple such
>>>>> characters where some are composed and others decomposed (rare).
>>>> 
>>>> 
>>>> What's "rare"?  We have to assume that input is in mixed composition in 
>>>> any system that doesn't explicitly normalize it, which (I think) includes 
>>>> most operating systems.  While it may be rare for any single string to 
>>>> contain characters in both compositions, it is very common to be 
>>>> processing a string that *might* have characters in both compositions -- 
>>>> in other words, that is not guaranteed to be normalized.  I think it would 
>>>> be clearer to drop the "(rare)" and just say "... normalized forms 
>>>> (NFC/NFD) or mixed (not normalized).".
>>>> 
>>>> 
>>>>> A minority of file systems (currently Mac OS X HFS+ only) will
>>>>> normalize the paths. In the case of HFS+, the path will be
>>>>> normalized into NFD and it will even be given back that way when
>>>>> listing the filesystem. 
>>>> 
>>>> 
>>>> Drop the word "even"?  The statement is not surprising.
>>>> 
>>>> 
>>>> [...]
>>>> 
>>>>> Similarities to case-sensitivity
>>>>> ===
>>>>> 
>>>>> - If two Unicode strings differ only by letter case/composition,
>>>> 
>>>> Drop "/composition" -- it's the subject of the following sentence.
>>>> 
>>>>> on some 
>>>> computer systems they refer to the same file, while on
>>>>> other systems 
>>>> they refer to different files.  The same applies
>>>>> if two Unicode strings 
>>>> differ only by composition. 
>>>> 
>>>> 
>>>>> [...]
>>>> 
>>>>> Client Changes
>>>>> ===
>>>>> 
>>>>> [...] An abstraction between the repository path and the file
>>>>> system path can be achieved by ensuring that there is a column
>>>>> in wc.db that contains the file system path in exactly the same
>>>>> form that the file system gives back. APIs in wc needs to be
>>>>> extended to ensure that all interaction with the file system is
>>>>> performed with the file system path.
>>>> 
>>>> [...]
>>>> 
>>>> This part seems to be the heart of the whole proposal.  You describe the 
>>>> data that we need, but the behaviour will also need to be described in 
>>>> detail.  Presumably much of the behaviour is boring and obvious (when we 
>>>> check out a new path and create it on disk, we store the disk path), but 
>>>> I'm sure there will be some less obvious parts (do we need to find out 
>>>> what the disk path of an 'excluded' node would be, even though we're not 
>>>> actually creating it on disk, for example).
>>>> 
>>>> 
>>>>> Use Cases
>>>>> ===
>>>>> 
>>>>> This change will only affect use cases which rely on creating
>>>>> paths that look like duplicates but use different unicode
>>>>> composition. It is highly unlikely anyone is relying on this..
>>>> 
>>>> 
>>>> Uh... it sounds like you are saying there are no interesting use cases for 
>>>> this proposal!  No, on the contrary, this proposal also affects checking 
>>>> out and using a WC on Mac HFS+ where the repository paths were created on 
>>>> another system and are not in NFD, and it allows that case to work.  
>>>> That's the more interesting use case, is it not?  It's definitely worth 
>>>> writing out the interesting case in full, including steps like checkout 
>>>> (or update) that brings in a non-NFD path, create a new file on the Mac, 
>>>> and commit.
>>>> 
>>>> - Julian
>>>> 
>>> 
>

Re: [RFC] Non-normalizing Unicode Composition Awareness

Reply via email to