Re: two equal filenames in one dir

Stefan Sperling Sun, 27 Jan 2013 04:01:23 -0800

On Sun, Jan 27, 2013 at 05:23:05AM -0500, Jiri B wrote:
> On Sun, Jan 27, 2013 at 05:20:14AM -0500, Jiri B wrote:
> > Hello,
> > 
> > I'm confused, how is it possible I have two files with same
> > names in one dir?
> > 
> > $ ls -li
> > total 1245376
> > 3611817 -rw-r--r--  1 jirib  jirib  168392755 Jan 14 23:35 
> > Crostata_Alla_Fruta.mp4
> > 3741698 -rw-r--r--  1 jirib  jirib  165519511 Mar 12  2010 Pizza 
> > Margherita-10115892.mp4
> > 3611818 -rw-r--r--  1 jirib  jirib  165519511 Jan 14 23:35 
> > Pizza_Margherita-10115892.mp4
> > 3741699 -rw-r--r--  1 jirib  jirib   68932635 Jul 31 21:02 jablecny 
> > kolac-46705666.mp4
> > 3611819 -rw-r--r--  1 jirib  jirib   68932635 Jan 14 23:35 
> > jablecny_kolac-46705666.mp4
> > 
> > $ sysctl kern.version 
> > kern.version=OpenBSD 5.2-current (GENERIC.MP) #20: Mon Jan 21 17:23:23 MST 
> > 2013
> >     [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
> 
> IGNORE, I need glasses :D

In this particular case, yes, you need glasses.

However, it is entirely possible to have two filenames in a directory
that look the same, at least when rendered as text, if you use unicode:
http://en.wikipedia.org/wiki/Precomposed_character#Comparing_precomposed_and_decomposed_characters

IMO this is a design problem in unicode, since this allows multiple
codepoint representations for some characters in unicode, i.e. codepoints
aren't a true 1-to-1 mapping to the characters they represent (this may
have been a deliberate design choice by the designers of unicode, but its
implications for software engineers are more serious than one might expect).

To make matters worse, some unicode-aware filesystems (e.g. Apple's HFS+)
have started normalising data to one particular representation, regardless
of which representation an application actually used to create a filename!

Handling such filenames on HFS+ is a huge problem for e.g. version control
systems where the designers of such systems forgot about normalising unicode
filenames at the application level (which is very easy to forget about if you
don't know the finer details of unicode). Both Subversion and git are affected
by this. git has introduced a workaround to address the issue recently but it
isn't backwards compatible with existing repositories (pathnames eventually
affect commit hashes in git) and enabling this workaround globally in a
distributed system isn't easy so it remains turned off by default (I mean a
distributed system in the sense that the application doesn't run on a single
machine, where svn is also "distributed" due to its client/server design).

If you ever design a program that handles UTF-8 data, please consider
whether this problem applies, or someone might get headaches trying to
fix it later. It can be a seriously annoying problem to fix after the fact.
And if you ever design a filesystem please don't follow HFS+'s example,
i.e. don't munge filename data provided by applications.

Re: two equal filenames in one dir

Reply via email to