On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote:
> > 
> > 
> > > > On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote:
> > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> 
> > > > > > > > > wrote:
> > > > > > > > Hi, everybody.
> > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've 
> > > > > > > > switched
> > > > > > > > to a new distfile mirror layout.  Users will be switching to 
> > > > > > > > the new
> > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they 
> > > > > > > > upgraded
> > > > > > > > already -- as their caches expire (24hrs).
> > > > > > > > The new layout is mostly a bow towards mirror admins, for some 
> > > > > > > > of whom
> > > > > > > > having a 60000+ files in a single directory have been a problem.
> > > > > > > > However, I suppose some of you also found e.g. the directory 
> > > > > > > > index
> > > > > > > > hardly usable due to its size.
> > > > This sounds like a filesystem issue. Do we know which filesystems are 
> > > > suffering?
> > > > ZFS should be fine. I believe ext2/ext3 have problems with this many 
> > > > files. ext4 is probably okay, but don’t quote me on that.
> > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > > may apply only to older ntfs versions.  NFS has been mentioned too.
> > 
> > ext2 and vfat are not surprises to me (outside of the idea that anyone 
> > would use them for a mirror). NTFS and NFS are though.
> > > However, just because modern filesystems can handle them efficiently, it
> > > doesn't mean having directories that huge comes with zero cost.
> > While I am okay with the change, what do you mean when you say that having 
> > huge directories does not come with zero cost?
> > 
> > Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> > this, but the impact should be negligible. Filesystems with O(log n) 
> > directory lookups would see faster directory lookups.
> > 
> > Outside of directory lookups, this could speed up up searches and sort 
> > operations when listing everything with just about any filesystem 
> > benefiting from the improvement.
> > 
> > Listing directories on such filesystems should not benefit from this unless 
> > you are using ls where the default behavior is to sort the directory 
> > contents (which is where the improvement when sorting comes into play). The 
> > need to sort the directory contents by default keeps ls from displaying 
> > anything until it has scanned the entire directory. The asymptotic 
> > complexity of a fast comparison based sort improves in this situation from 
> > O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
> > independently. A further speed up could be obtained by doing multithreading 
> > to parallelize the sort operations.
> I read your original email late at night and I misread the description of how 
> this works.
> 
> At an initial glance, I thought we were doing a prefix approach (with the 
> caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
> hash of the filenames.
> 
> That would keep all buckets balanced, which gives the best directory lookup 
> times on O(log n) lookup filesystems, but I think there is something to be 
> gained from using the less optimal approach of using filename prefixes:
> 
> * some regex searches on distfiles can be accelerated
> * generating a sorted list of all distfiles becomes asymptotically faster
> * it is easy for a user to find all versions of a given distfile
> * no need to calculate a cryptographic hash
> 
> I realize that I am late to propose it, but could we consider a switch to 
> this alternative arrangement?

No, we can't.  Please read either the original discussion on the bug, or
the linked article.  It's explained in detail why this won't work.

-- 
Best regards,
Michał Górny

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to