BTW: on decent machines an individual 1 GiB write does not make the user wait: on write the data is first copied in the the AFS file's mapping, later into the cache file's mapping (the former step can be avoided by writing into the chunk files directly). On reads the reader is woken up on every RX packet, ensuring streaming to the user. Here again, the double copy can be avoided.
What happens here depends on the VM model of the machine, and how we interact with it. But on Linux, at least, this isn't strictly true. Here's how things work in 1.4
There are two different codepaths, one for writes from the write() syscall, and the other invoked when a page that is mmap'd gets written to. With write()s, what we do currently is that we prepare a page for the kernel - the kernel then takes care of copying the buffer passed by the user to that page, and lets us know when it has completed. We then take that data, from that page, and do a write() of it against the backing store. We then return control to the user, who's had to wait whilst all this occurs. In the background, the pdflush process then takes care of outputing this data to disk.
With mmap, things are a little different. pdflush is in charge of our writing and, at intervals, will call our writepage() operation on pages that the user has dirtied. This all happens completely behind the scenes. We then write the AFS dirty page out into the backing store (by using that store's write command), and it's scheduled for another background flush.
In 1.5 this is streamlined a little by only working at the page level, which avoids some context swaps, and copies. As I noted in an earlier email, we also do more in the background in order to get control back to the user quicker. One further optimisation is that we shouldn't be doing the write to the backing cache from the write() syscall. All write is supposed to do is to copy the data from the user into the filesystem's mapping, and mark the page dirty. It should then be up to the pdflush process to move this out to the backing store - I intend to revisit this at some point, but my previous attempts have resulted in a cache manager that is very prone to deadlocks.
As you note, our Linux implementation creates two copies of the data - one in AFS's mapping, the other in the backing files. However, we cannot easily get rid of this duplication - there's no simple mechanism of bypassing the VM and 'writing into the chunk files directly'. Using direct-IO would be a possibility, but we'd need to handle doing this in the backgound, otherwise the user would end up having to wait until chunk files actually made it to the disk, and it would limit the range of filesystems we can use as a backing cache.
S. _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
