> On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote:
>
>
>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote:
>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> wrote:
>>>>>>> Hi, everybody.
>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>> to a new distfile mirror layout. Users will be switching to the new
>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>> already -- as their caches expire (24hrs).
>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>> hardly usable due to its size.
>>> This sounds like a filesystem issue. Do we know which filesystems are
>>> suffering?
>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files.
>>> ext4 is probably okay, but don’t quote me on that.
>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>> may apply only to older ntfs versions. NFS has been mentioned too.
>
> ext2 and vfat are not surprises to me (outside of the idea that anyone would
> use them for a mirror). NTFS and NFS are though.
>> However, just because modern filesystems can handle them efficiently, it
>> doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having
> huge directories does not come with zero cost?
>
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by
> this, but the impact should be negligible. Filesystems with O(log n)
> directory lookups would see faster directory lookups.
>
> Outside of directory lookups, this could speed up up searches and sort
> operations when listing everything with just about any filesystem benefiting
> from the improvement.
>
> Listing directories on such filesystems should not benefit from this unless
> you are using ls where the default behavior is to sort the directory contents
> (which is where the improvement when sorting comes into play). The need to
> sort the directory contents by default keeps ls from displaying anything
> until it has scanned the entire directory. The asymptotic complexity of a
> fast comparison based sort improves in this situation from O(nlogn) to
> O(nlog(n/b)) provided that you sort each subdirectory independently. A
> further speed up could be obtained by doing multithreading to parallelize the
> sort operations.
I read your original email late at night and I misread the description of how
this works.
At an initial glance, I thought we were doing a prefix approach (with the
caveat that buckets are unbalanced). In reality, we are doing a cryptographic
hash of the filenames.
That would keep all buckets balanced, which gives the best directory lookup
times on O(log n) lookup filesystems, but I think there is something to be
gained from using the less optimal approach of using filename prefixes:
* some regex searches on distfiles can be accelerated
* generating a sorted list of all distfiles becomes asymptotically faster
* it is easy for a user to find all versions of a given distfile
* no need to calculate a cryptographic hash
I realize that I am late to propose it, but could we consider a switch to this
alternative arrangement?
The bulk of the performance gain should be realized with either approach.
> Since I know someone will call me out on that comment, I will explain. Each
> bucket has roughly n/b items in it where n is the total number and b is the
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn)
> comparison sort on this very special case where you call it multiple times on
> data that has been persorted by prefix into buckets.
>
> Is there any other benefit to this or did I get everything?
>
> By the way, it is offtopic for the thread, but it occurs to me that a hybrid
> of radix sort and A comparison based sort could give us a general sorting
> algorithm that is asymptotically faster than O(nlogn).
>> [1] https://bugs.gentoo.org/534528
>> --
>> Best regards,
>> Michał Górny