I said:
I'd look at some of the more constraining, yet still
common cases, and make sure it worked reasonably
well without requiring magic. My list would be:
ext2, ext3, NFS, and Windows' NTFS (stupid short filenames,

Petr Baudis replied:
I personally don't mind getting it work on more places, if it doesn't
make git work (measurably) worse on modern Linux systems, the code will
not go to hell, you tell me what needs to be done and preferably give me
the patches. ;-)

Okay, that's great.

The one potential issue I know of (after trying to read from the
firehose^Wlist archives) is that some are worried about poor filesystems
when there are a large number of objects in an object directory.

After doing some calculations, it seems to me that perhaps this
isn't really such a big deal, if there's a top directory such as
the 16-bit (2-char) top directory currently in git-pasky.
Removing the top directory would improve performance for the better
filesystems, but would be an absolute KILLER to poorer systems, so
I'd keep the 2**8 top directory just as it is in git-pasky.
It's a compromise that means people can ease into git, and then
switch when their projects grow to large sizes.
My calculations are below, but I could be mistaken; let me
know if I'm all wet.

Does anyone know of any other issues in how git data is stored that
might cause problems for some situations?  Windows' case-insensitive/
case-preserving model for NTFS and vfat32 seems to be enough
(since the case is preserved) so that the format should work,
and you can just demand that
special git files use Unix formats ("/" as dir separator,
Unix end-of-lines).  The implementation currently would need
change to work easily on Windows (dealing with binary opens at least,
and probably rewriting the shell programs for those unwilling to
install Cygwin), but those can be done later if desired
without interfering with the interface formats.

========================= Details =========================

Basically, I'd like "git" to work on:
(1) nearly ANY system on small-to-medium projects,
    even if their filesystems do linear searches in directories,
    over a lengthy time.  Ideally possibly (though poorly)
    on larger systems.
(2) work well on large projects (e.g., kernel) on _common_
    development platforms (ext2, ext3, NTFS, NFS).

It all depends on what you're optimizing for; but humor me
if those were your requirements...

Case 1:
The top (2-char) directory appears likely to make small projects
perform okay, and large projects possible, on stupid filesystems.
The one level extra directory is actually not a bad compromise
to make things "just work" on just about anything for smaller scales.
* git-paskey (a tiny project) has ~2K objects in 2weeks; at that pace,
4Kobjects/month for 10 years, you'd have 480K objects.
That's absurd for even tiny projects, and it's unlikely that
a participant in a tiny project would be willing to change
filesystems just to participate.  But then if you
divide it among 256 directories = 1875 files/directory average.
Linear search is undesirable (about 1000 entry checks on
average to find each entry), but it's nowhere near the
2^16 dir entries that made people afraid.
Switching to a 2^12 top directory, you have an average of 117 entries
in each subdir (and 4096 entries at the top), yielding
an average of (117+4096)/2 = 2106 entry checks to find an entry.
* I estimated also for the big end, using the Linux kernel;
I guesstimated 36,000 objects/month for the kernel**. Over 10 years that
accumulates 4,320,000 objects, completely insane for a flat file
on a stupid filesystem. If it has a one-level 256dir directory, that's
16875 objects/directory.  Now THAT'S painful,
though nowhere near the 2^16 limit most quoted as bad.
* For 10K objects/month, and a top dir of 2**8, you have 1,200,000
objects; each dir has 4680 entries (average lookup: 2468 entries).
Dividing into 2**12 has 292/directory, average lookup: 2194.

On 2**12 vs. 2**8, it's not clear-cut. 2**8 works best for small
projects, 2**12 for larger.  My guess is that stupid filesystems
will tend to be used primarily only on small projects, so 2**8 might
be the better choice but that's debatable.

Case 2:
Thankfully adequate systems are finally more common, and they're
common enough that for really large projects (kernel) it seems
reasonable to demand such filesystems.
Ext2 & ext3 have had htree for a while now, and it's enabled by
default on at least Fedora Core 3.  If it's off, just do:
 tune2fs -O dir_index /dev/hdFOO; e2fsck -fD /dev/hdFOO
This stuff has been around so long that it should just be
a trivial command by any developer today.
ReiserFS has hashing too.  Windows' NTFS does
tree-balancing (it appears not as good as the hashing htree
system of ext2/ext3, but it should work tolerably since it's no
longer a linear search).  One useful factoid: For good NTFS
performance with git on large projects,
you should disable short name generation on the big directories
(Microsoft recommends this when >300,000 names are in one dir).
NTFS (and VFAT32) allow filenames up to 255 chars, and
filepaths up to 260 chars, so that seems okay.
I was primarily concerned about NTFS, and that seems to have
the necessities.  This info should in some FAQ or
documentation ("Using git for large projects").

It _seems_ to me that the NFS implementations are likely to
do similar things, but I don't know.  And I've not tested
anything on real systems, which is the real test.
Anyone know more about the limits of the NFS implementations?

More directory levels could be created to make
stupid filesystems happier, but that interferes with smart filesystems.
You could try to make filesystem layout a per-user issue,
but that makes using rsync more complicated.
A link farm could be created, though those are a pain to maintain.
It DOES turn out there are many alternatives if necesary, e.g.,
configurations per object database, or automatically "fixing"
things for a local configuration as data comes in or out,
though if you can avoid that it'd be better.

** Looking at "linux-2.4.0-to-2.6.12-rc2-patchset", I count 28237 patches; "RCS file:" occurs 188119 times & I'll claim that that approximates the number of different file objects IF there were no intermediate files. If on average there are 5 versions of a file before it gets into the mainline, and 3 commits before the final mainline patch, I get approximately this many objects in a "real" object db: (28237*(3+1) trees) *2 (if #commits==#trees) + (188119*(5+1) file objs)) = 1,354,610 objects from 2002/02/05 to 2005/04/04 = about 36,000 objects/month.

Am I missing anything?

--- David A. Wheeler
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to