> On Oct 19, 2019, at 4:03 PM, Michał Górny <mgo...@gentoo.org> wrote:
> 
> On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
>>>> On Oct 18, 2019, at 9:10 PM, Richard Yao <r...@gentoo.org> wrote:
>>> 
>>> 
>>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgo...@gentoo.org> wrote:
>>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgo...@gentoo.org> wrote:
>>>>>>>>> Hi, everybody.
>>>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've 
>>>>>>>>> switched
>>>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>>>> already -- as their caches expire (24hrs).
>>>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>>>> hardly usable due to its size.
>>>>> This sounds like a filesystem issue. Do we know which filesystems are 
>>>>> suffering?
>>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many 
>>>>> files. ext4 is probably okay, but don’t quote me on that.
>>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>>> 
>>> ext2 and vfat are not surprises to me (outside of the idea that anyone 
>>> would use them for a mirror). NTFS and NFS are though.
>>>> However, just because modern filesystems can handle them efficiently, it
>>>> doesn't mean having directories that huge comes with zero cost.
>>> While I am okay with the change, what do you mean when you say that having 
>>> huge directories does not come with zero cost?
>>> 
>>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
>>> this, but the impact should be negligible. Filesystems with O(log n) 
>>> directory lookups would see faster directory lookups.
>>> 
>>> Outside of directory lookups, this could speed up up searches and sort 
>>> operations when listing everything with just about any filesystem 
>>> benefiting from the improvement.
>>> 
>>> Listing directories on such filesystems should not benefit from this unless 
>>> you are using ls where the default behavior is to sort the directory 
>>> contents (which is where the improvement when sorting comes into play). The 
>>> need to sort the directory contents by default keeps ls from displaying 
>>> anything until it has scanned the entire directory. The asymptotic 
>>> complexity of a fast comparison based sort improves in this situation from 
>>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
>>> independently. A further speed up could be obtained by doing multithreading 
>>> to parallelize the sort operations.
>> I read your original email late at night and I misread the description of 
>> how this works.
>> 
>> At an initial glance, I thought we were doing a prefix approach (with the 
>> caveat that buckets are unbalanced). In reality, we are doing a 
>> cryptographic hash of the filenames.
>> 
>> That would keep all buckets balanced, which gives the best directory lookup 
>> times on O(log n) lookup filesystems, but I think there is something to be 
>> gained from using the less optimal approach of using filename prefixes:
>> 
>> * some regex searches on distfiles can be accelerated
>> * generating a sorted list of all distfiles becomes asymptotically faster
>> * it is easy for a user to find all versions of a given distfile
>> * no need to calculate a cryptographic hash
>> 
>> I realize that I am late to propose it, but could we consider a switch to 
>> this alternative arrangement?
> 
> No, we can't.  Please read either the original discussion on the bug, or
> the linked article.  It's explained in detail why this won't work.
Alright. I am convinced. Thanks.
> 
> -- 
> Best regards,
> Michał Górny
> 


Reply via email to