On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote: > > On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote: > > > > > > > > On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote: > > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: > > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> > > > > > > > > > wrote: > > > > > > > > Hi, everybody. > > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've > > > > > > > > switched > > > > > > > > to a new distfile mirror layout. Users will be switching to > > > > > > > > the new > > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they > > > > > > > > upgraded > > > > > > > > already -- as their caches expire (24hrs). > > > > > > > > The new layout is mostly a bow towards mirror admins, for some > > > > > > > > of whom > > > > > > > > having a 60000+ files in a single directory have been a problem. > > > > > > > > However, I suppose some of you also found e.g. the directory > > > > > > > > index > > > > > > > > hardly usable due to its size. > > > > This sounds like a filesystem issue. Do we know which filesystems are > > > > suffering? > > > > ZFS should be fine. I believe ext2/ext3 have problems with this many > > > > files. ext4 is probably okay, but don’t quote me on that. > > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > > > may apply only to older ntfs versions. NFS has been mentioned too. > > > > ext2 and vfat are not surprises to me (outside of the idea that anyone > > would use them for a mirror). NTFS and NFS are though. > > > However, just because modern filesystems can handle them efficiently, it > > > doesn't mean having directories that huge comes with zero cost. > > While I am okay with the change, what do you mean when you say that having > > huge directories does not come with zero cost? > > > > Filesystems with O(1) directory lookups like ZFS would probably be hurt by > > this, but the impact should be negligible. Filesystems with O(log n) > > directory lookups would see faster directory lookups. > > > > Outside of directory lookups, this could speed up up searches and sort > > operations when listing everything with just about any filesystem > > benefiting from the improvement. > > > > Listing directories on such filesystems should not benefit from this unless > > you are using ls where the default behavior is to sort the directory > > contents (which is where the improvement when sorting comes into play). The > > need to sort the directory contents by default keeps ls from displaying > > anything until it has scanned the entire directory. The asymptotic > > complexity of a fast comparison based sort improves in this situation from > > O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory > > independently. A further speed up could be obtained by doing multithreading > > to parallelize the sort operations. > I read your original email late at night and I misread the description of how > this works. > > At an initial glance, I thought we were doing a prefix approach (with the > caveat that buckets are unbalanced). In reality, we are doing a cryptographic > hash of the filenames. > > That would keep all buckets balanced, which gives the best directory lookup > times on O(log n) lookup filesystems, but I think there is something to be > gained from using the less optimal approach of using filename prefixes: > > * some regex searches on distfiles can be accelerated > * generating a sorted list of all distfiles becomes asymptotically faster > * it is easy for a user to find all versions of a given distfile > * no need to calculate a cryptographic hash > > I realize that I am late to propose it, but could we consider a switch to > this alternative arrangement?
No, we can't. Please read either the original discussion on the bug, or the linked article. It's explained in detail why this won't work. -- Best regards, Michał Górny
signature.asc
Description: This is a digitally signed message part