Re: [gentoo-dev] New distfile mirror layout

Jaco Kroon Mon, 21 Oct 2019 23:53:24 -0700

Hi All,


On 2019/10/21 18:42, Richard Yao wrote:


If we consider the access frequency, it might actually not be that bad. 
Consider a simple example with 500 files and two directory buckets. If we have 
250 in each, then the size of the directory is always 250. However, if 50 files 
are accessed 90% of the time, then putting 450 into one directory and that 50 
into another directory, we end up with the performance of the O(n) directory 
lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make 
changes to benefit O(n) directory lookup filesystems, but if we are, then the 
hashing approach is not necessarily the best one. It is only the best when all 
files are accessed with equal frequency, which would be an incorrect 
assumption. A more human friendly approach might still be better. I doubt that 
we have the data to determine that though.

Also, another idea is to use a cheap hash function (e.g. fletcher) and just 
have the mirrors do the hashing behind the scenes. Then we would have the best 
of both worlds.



Experience:

ext4 sucks at targeting name lookups without dir_index feature (O(n)lookups - scans all entries in the folder). With dir_index readdirperformance is crap. Pick your poison I guess. Most of our largerfilesystems (2TB+, but especially the 80TB+ ones) we've reverted todisabling dir_index as the benefit is outweighed by the crappy readdir()and glob() performance.

There doesn't seem to be a real specific tip-over point, and it seems todepend a lot on RAM availability and harddrive speed (obviously). So ifdentries gets cached, disk speeds becomes less of an issue. However, onlarge folders (where I typically use 10k as a value for large based on"gut feeling" and "unquantifiable experience" and "nothing scientific atall") I find that even with lots of RAM two consecutive ls commandsremains terribly slow. Switch off dir_index and that becomes an order ofmagnitude faster.

I don't have a great deal of experience with XFS, but on those systemswhere we do it's generally on a VM, and perceivably (again, notscientific) our experience has been that it feels slower. Again, notscientific, just perception.

I'm in support for the change. This will bucket to 256 folders andshould have a reasonably even split between folders. If required asecond layer could be introduced by using the 3rd and 4th digits of thehash for a second layer. Any hash should be fine, it really doesn'tneed to be cryptographically strong, it just needs to provide a goodspread and be really fast. Generally a hash table should have a primenumber of buckets to assist with hash bias, but frankly, that's overcomplicating the situation here.

I also agree with others that it used to be easy to get distfiles as andwhen needed, so an alternative structure could mirror that of theportage tree itself, in other words "cat/pkg/distfile". This perhapsjust shifts the issue:

jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name"*-*" | wc -l

167
jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
19412

jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i-maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10

347 net-misc
373 media-sound
395 media-libs
399 dev-util
505 dev-libs
528 dev-java
684 dev-haskell
690 dev-ruby
1601 dev-perl
1889 dev-python

So that's average 116 sub folders under the top layer (only two over1000), and then presumably less than 100 distfiles maximum per package? Probably overkill but would (should) solve both the too many files perfolder as well as the easy lookup by hand issue.

I don't have a preference on either solution though but do agree that"easy finding of distfiles" are handy. The INDEX mechanism is fine for me.


Kind Regards,

Jaco

Re: [gentoo-dev] New distfile mirror layout

Reply via email to