(Bcc bugs) Recently, we've seen that the unix CM's cache tracking figures for a ZFS cache can be very wrong. I know the tracked cache usage value (the 'fs getcacheparms' value) was never 100% accurate, but with certain ZFS configurations, it can be wrong on the order of multiples of the cacheinfo size itself. (note that this is a different problem than the one fixed in gerrit 338)
The discrepancy we're hitting can be easily seen by plainly doing this on a ZFS filesystem with default settings: dd if=/dev/urandom of=somefile bs=1024 count=1024 sleep 10 dd if=/dev/urandom of=somefile bs=1 count=1 sleep 10 stat somefile | grep Size Size: 1 Blocks: 261 IO Block: 131072 regular file So, a file that is 1M then truncated down to 1 byte still takes up 130k-ish disk space. Now, when the CM truncates a 1M cachefile to something under 130k, the CM will record that file as taking up the file length rounded up to the next kb. Which is, well, a lot smaller than what it actually takes up. In the absolute worst case, I think we could take up 5 times the cacheinfo size on disk (128k for each cache file, cachesize/32k cache files by default). While that's unlikely to hit, we have already seen it go over by a gig or two on a cache smaller than 4G. Now, this is with a recordsize of 128k (the default, I believe). Changing the recordsize to something smaller obviously makes smaller files take up less space. With recordsize=1k, a 1 byte (or 1k) file appears to take up only 5k. But this has the downside of larger files causing more overhead (a 1M file takes up about 1122k). I'm not sure what to do about this. Does anyone reading this know enough about ZFS internals to shed some light on this? I've got a few potential directions to go in, though: (A) If someone can provide an equation that says a "if file is X bytes long, and we have a recordsize of Y, then the file will take up at most Z bytes on disk", we could make a special case in the cache-tracking logic for ZFS. The recordsize appears to be obtainable via the statvfs blocksize. (B) If someone knows of a way in-kernel to tell a file in ZFS to not behave in this way, we could make a certain call on the vnode. For example, just creating a 1-byte file does not take up 130k, it's only when you make a large file and truncate it down; there may be a way to make those two cases equivalent size-wise. The brute-force-y solution could be to unlink files instead of truncating them in certain cases, but that seems suboptimal. (C) We simply use the blocksize instead of the fragsize to calculate afs_fsfragsize in the special zfs case? This still seems like it would result in incorrect cache tracking, but maybe it's good enough? (D) force afs_fsfragsize to 128k-1 for ZFS, but that's obviously horribly inefficient for most cases. Or the obvious (E), tell people to not use ZFS disk caches. -- Andrew Deason [email protected] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
