Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for 
change notification.

The "NonNormalizingUnicodeCompositionAwareness" page has been changed by Thomas 
Åkesson:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness?action=diff&rev1=10&rev2=11

  
  There could be a performance impact. [Need more data] However, the 'add' 
operation is not one of the most frequent ones, in a typical installation.
  {{{#!wiki note
- The major impact would not stem from collision avoidance on `add` but 
normalization during directory search, which affects most other operations. For 
the server, it is probably better to store names twice (original for display 
and normalized for indexing) rather than normalize on every lookup.}}}
+ The major impact would not stem from collision avoidance on `add` but 
normalization during directory search, which affects most other operations. For 
the server, it is probably better to store names twice (original for display 
and normalized for indexing) rather than normalize on every lookup.
+ 
+ ThomasAkesson: It might be better to store names twice, but I don't see why 
the server needs to do normalization during directory search? That would be a 
client side task in this proposal. 
+ }}}
  
  It is not possible to rely on client behavior. A Subversion server can be 
accessed via mod_dav_svn, and elder Subversion clients.
  
@@ -100, +103 @@

  
  It might be more feasible to implement such an abstraction now in wc-ng than 
it was in Subversion <=1.6. 
  
- TODO: This section needs input from someone more familiar with wc-ng database 
design.
  
- === WC Database Columns ===
+ === Alternative Approaches ===
  
- Columns of interest in wc.db:
+ There are different approaches to implementing this abstraction of paths. The 
following have been identified so far, each with its Wiki page:
  
-  * The repository path as stored on server: repos_path (e.g. 
"project/dir/file.txt")
+  * WC Database columns: UnicodeClientColumns
+  * SQLite collation: UnicodeCollation
  
-  * The local path from WC root to node: local_relpath (e.g. "dir/file.txt")
+ The following sections are applicable to all above approaches. 
  
-  * The local path from WC root to node parent: parent_relpath (e.g. "dir")
- 
- All three paths are in UTF-8 but NFC/NFD is not currently specified. 
local_relpath/parent_relpath get converted from UTF-8 to whatever locale 
encoding is in use whenever they are used to access the filesystem.
- 
- Takesson: Is this conversion done on the fly every time? I am guessing this 
works because locale encoding is a reversible process , otherwise lookups in 
the database would fail?
- 
- An abstraction between the repository path and the file system path can be 
achieved by ensuring that there is a column in wc.db that contains the file 
system path in exactly the same form that the file system gives back. APIs in 
wc needs to be extended to ensure that all interaction with the file system is 
performed with the file system path.
- 
- 
- ==== Alternative 1: Redefine local_relpath ====
- 
- Redefine the existing column local_relpath to contain the path as stored in 
the file system. Code that currently relies on local_relpath being a substring 
of repos_path needs to be adjusted. E.g. a node might be considered switched 
when this condition is not met.
- 
- It would generally be desirable to use repos_path when referring to entries 
rather than local_relpath.
- 
- This alternative can be simulated using the attached script 
localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout 
should produce if this alternative was implemented in Subversion itself:
-  * svn co ...
-  * svn stat #Shows any problematic items
-  * localrelpath2nfd.sh
-  * svn stat #Should be clean apart from misperception that some items are 
switched
- 
- TODO: provide a dump file with suitable test data. 
- 
- ==== Alternative 2: Introduce local_relpath_disk ====
- 
- A new column, local_relpath_disk, is added that contains the path as stored 
in the file system. This column will be used on all systems to interact with 
the file system. Currently, the content of columns local_relpath and  
local_relpath_disk will be identical on all file systems except HFS+.
- 
- I guess this would require parent_relpath_disk as well?  Or would you plan to 
use the local_relpath==parent_relpath row to get local_relpath_disk for 
parent_relpath?
- 
- Takesson: thanks for pointing that out. I will update both alternatives, alt 
1 redefining both and alt 2 "duplicating" both. 
  
  
  === Normalized uniqueness ===
  
- Repository path uniqueness should be checked in normalized form during add 
operations, in order to prevent new "normalized-name collisions" as early as 
possible. It might be acceptable to identify this later during commit, since it 
is a quite rare condition.
+ Repository path uniqueness should be checked in normalized form during add 
operations, in order to prevent new "normalized-name collisions" as early as 
possible. It might be acceptable to identify this later during commit, since 
very few users will encounter this condition. At the latest, it will be 
identified by the server (with above change). 
  
- When an existing "normalized-name collision" arrives to a Working Copy on 
HFS+ via checkout or update, there will be a uniqueness issue in the column 
local_relpath/local_relpath_disk and a situation somewhat similar to an 
obstruction. This should be communicated in some friendly way, similar to 
conflicts on case-insensitive file systems.
+ When an existing "normalized-name collision" arrives to a Working Copy on 
HFS+ via checkout or update, there will be a uniqueness issue in the column 
local_relpath (queried with collation) or in local_relpath_disk and a situation 
somewhat similar to an obstruction. This should be communicated in some 
friendly way, similar to conflicts on case-insensitive file systems.
- 
  
  === Pristine Storage ===
  
@@ -155, +127 @@

  
  === Command Line ===
  
- When referring to WC entries using the command line on Mac OSX, the 
tab-completion works unreliably because the keyboard typically produces 
composed characters while files are NFD. The tab completion is a general Mac 
OSX issue which should be addressed by Apple. However, Subversion could be 
helpful when attempting to identify entries referred to via the command line. 
+ When referring to WC entries using the command line on Mac OSX, the 
tab-completion works unreliably because the keyboard typically produces 
composed characters while files are NFD. The tab completion is a general Mac 
OSX issue which should be addressed by Apple, specifically the case; user types 
beginning including a composed character (currently matches nothing on disk). 
However, Subversion could be helpful when attempting to identify entries 
referred to via the command line. 
  
-  * Subversion must recognize paths that match the file system Unicode path 
(even if it does not match the repository path). Failure to do so makes 
tab-completion unusable.
+ * Subversion must recognize paths that match the file system Unicode path 
(even if it does not match the repository path). Failure to do so makes 
tab-completion unusable, especially on Mac OS X. 
-   * Paths on the command line should be matched against 
local_relpath/local_relpath_disk. 
  
-  * Subversion should as a fallback (optional) recognize paths that match the 
repository Unicode path. Failure to do so might make scripts less portable and 
might require the use of tab-completion in order to reference entries.
+ * Subversion must recognize paths that match the repository path in NFC. 
Failure to do so might make scripts less portable and might require the use of 
tab-completion in order to reference non-NFC entries (since keyboard input is 
typically NFC). E.g. A file added by Mac OS X can currently not be typed on 
other (any actually) OSes. 
  
+ 
+ === Hashtables in WC-NG ===
+ 
+ Bert has mentioned expected issues related to hashtables. 
+ 
+ TODO: Please elaborate on when they are used and approximately where in the 
codebase. 
+ 
+ 
- === Subcommand Changes ===
+ === Subcommand Status ===
  
- Specific changes to svn subcommands are outlined below. 
+ Current issues with svn subcommands related to Unicode composition are 
outlined below.
  
- All commands that access files in the Working Copy must do so by getting the 
path from the column local_relpath/local_relpath_disk. 
+ Below investigations where made on svn 1.7.x. 
  
- TODO: Investigate which subcommands currently use local_relpath for other 
purposes than accessing the file. With alternative 1 (above), it will NOT be 
acceptable to use local_relpath for comparison/substring operations with other 
paths, e.g. repos_path.
- 
- 
- ==== Checkout/Update ====
+ ==== Checkout ====
  
+ Completes, but creates a "broken" WC, see Status below. 
- When adding paths to the WC, determine the actual filesystem path and store 
that in local_relpath/local_relpath_disk. This is actually only required on 
OSX. How can this be done? 
-  * Do we get a handle back from the filesystem after creating a file/dir that 
can be queried for the path?
-  * Use platform dependent APIs to establish the expected path.
-  * Alternatively, first look for the exact same path (will find the one on 
most filesystems) then fall back to globbing with Unicode composition aware 
comparison.
  
- TODO: Do we need to process paths that are not actually checked out due to 
the depth setting?
+ ==== Update ====
  
+ Issues are related to the status issues when reporting the WC. Other issues?
  
  ==== Status ====
  
- The status subcommand incorrectly reports externals when manually adjusting 
local_relpath to match the filesystem.
+ The status subcommand reports one unversioned and one missing entry for each 
non-NFD on Mac OS X. This reflects the general WC issues with HFS+. 
  
- TODO: Clarify if status performs string comparisons between local_relpath and 
some other path.
  
- TODO: how does status show a file whose name changed to a value that 
canonicalizes to the same value as the original name? (is that possible?)
+ ==== Add ====
  
- ==== Add and mkdir ====
+ Works and creates an entry with the same composition as on disk. 
  
  Since this approach does not dictate a Normalized repository storage, the add 
subcommand should not perform any normalization.
  
- The uniqueness test should be Unicode aware to avoid a "normalized-name 
collision". This is not vital but desirable for better usability (has no effect 
on Mac OSX since it is not possible to create such collisions).
  
- TODO: Anything else?
+ ==== mkdir ====
+ 
+ TODO: Test. Suspect this might fail.
  
  
  ==== Commit ====
  
+ Seems to work. 
- No specific changes expected.
- 
- TODO: Confirm.
- 
- ==== Changelist ====
- 
- Changelists should use repos_path to refer to entries, unless already the 
case.
- 
  
  ==== ... ====
  
@@ -224, +191 @@

  
  {{{#!wiki note
  In a URL there are several different parts: the hostname, the <Location> 
(httpd only), the repository relpath(ra_svn) or basename(ra_dav with 
SVNParentPath), and the fspath.  Some of them might also be subject to 
canonicalization issues (eg: repos basename as handled by Mac mod_dav_svn).
+ 
+ ThomasAkesson: Can we accept the limitation to not have decomposable 
characters in these parts? They are defined by administrators while paths 
inside repositories are defined by users. 
  }}}
  
  == Use Cases ==

Reply via email to