Re: [gentoo-dev] New distfile mirror layout

Richard Yao Sat, 19 Oct 2019 12:26:57 -0700

> On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote:
> 
> 
>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote:
>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> wrote:
>>>>>>> Hi, everybody.
>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>> already -- as their caches expire (24hrs).
>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>> hardly usable due to its size.
>>> This sounds like a filesystem issue. Do we know which filesystems are 
>>> suffering?
>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
>>> ext4 is probably okay, but don’t quote me on that.
>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>> may apply only to older ntfs versions.  NFS has been mentioned too.
> 
> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
> use them for a mirror). NTFS and NFS are though.
>> However, just because modern filesystems can handle them efficiently, it
>> doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having 
> huge directories does not come with zero cost?
> 
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> this, but the impact should be negligible. Filesystems with O(log n) 
> directory lookups would see faster directory lookups.
> 
> Outside of directory lookups, this could speed up up searches and sort 
> operations when listing everything with just about any filesystem benefiting 
> from the improvement.
> 
> Listing directories on such filesystems should not benefit from this unless 
> you are using ls where the default behavior is to sort the directory contents 
> (which is where the improvement when sorting comes into play). The need to 
> sort the directory contents by default keeps ls from displaying anything 
> until it has scanned the entire directory. The asymptotic complexity of a 
> fast comparison based sort improves in this situation from O(nlogn) to 
> O(nlog(n/b)) provided that you sort each subdirectory independently. A 
> further speed up could be obtained by doing multithreading to parallelize the 
> sort operations.
I read your original email late at night and I misread the description of how 
this works.

At an initial glance, I thought we were doing a prefix approach (with the 
caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
hash of the filenames.

That would keep all buckets balanced, which gives the best directory lookup 
times on O(log n) lookup filesystems, but I think there is something to be 
gained from using the less optimal approach of using filename prefixes:

* some regex searches on distfiles can be accelerated
* generating a sorted list of all distfiles becomes asymptotically faster
* it is easy for a user to find all versions of a given distfile
* no need to calculate a cryptographic hash

I realize that I am late to propose it, but could we consider a switch to this 
alternative arrangement?

The bulk of the performance gain should be realized with either approach.

> Since I know someone will call me out on that comment, I will explain. Each 
> bucket has roughly n/b items in it where n is the total number and b is the 
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
> comparison sort on this very special case where you call it multiple times on 
> data that has been persorted by prefix into buckets.
> 
> Is there any other benefit to this or did I get everything?
> 
> By the way, it is offtopic for the thread, but it occurs to me that a hybrid 
> of radix sort and A comparison based sort could give us a general sorting 
> algorithm that is asymptotically faster than O(nlogn).
>> [1] https://bugs.gentoo.org/534528
>> --
>> Best regards,
>> Michał Górny
Re: [gentoo-dev] New distfile mirror layout

Reply via email to