On 10/14/2013 09:38 PM, Saso Kiselkov wrote: > On 10/15/13 2:32 AM, Matthew Ahrens wrote: >> On Mon, Oct 14, 2013 at 5:45 PM, Saso Kiselkov <[email protected] >> <mailto:[email protected]>> wrote: >> >> In that >> the case, however, I have to ask why the original author put the retry >> mechanism in with the pressure valve (halving the memory request size >> each iteration) in the first place... >> >> >> I think I wrote that code a decade ago... The idea was probably that >> there might not be enough physical memory free -- nothing about >> fragmentation. This was back when you could boot off of UFS and then >> load the zfs kernel module later (frequently used while debugging). But >> I'm not sure that the retry was ever exercised. I can't imagine it has >> been used in the past 5 years since (AFAIK) solaris & illumos distros >> only boot from ZFS. > > Actually there are plenty of distros which boot from UFS (SmartOS being > one famous example), but since ZFS is pretty much the only high-data > volume capable filesystem on Illumos, I doubt anybody will avoid loading > ZFS at some early point in boot. Perhaps if you're running compute-only > and might occasionally load ZFS after that, but that's really stretching > my imagination. > >> Linux (and FreeBSD?) may be a different matter. Since the zfs kernel >> module can be loaded much after boot, there may be insufficient memory. >> Don't know how likely that is, though. Nor do I know how memory >> fragmentation (virtual or physical) might come into play on those platforms. > > Then I guess we should reach out to these guys. > FreeBSD & Linux folks: comments please? > > (Or should we post elsewhere if there are too few people subscribed to > this list yet?)
I was drafting a response when this email arrived in my inbox. I will CC Brian to ensure this gets his attention too. Under ideal circumstances where you have a 64-bit kernel and 1GB worth of free pages in the normal zone (not to be confused with Solaris zones), a 1GB virtual memory allocation is possible under Linux. However, should any of those conditions be violated, the ability to do such an allocation can no longer be guaranteed. The virtual address space on Linux is split between the kernel and userland. The Linux kernel developers made the conscious decision to severely cripple the virtual memory allocator "for performance reasons". Only a small portion of the kernel virtual address space is available for use in virtual memory allocations. On 64-bit Linux, this is not a problem because the kernel virtual address space is *huge*, but on 32-bit Linux, this is a major problem. Additionally, Linux has no code to respond to memory pressure when we are out of virtual memory address space, so on 32-bit Linux, things go south fast. These two things are the precise reasons why ZFSOnLinux cannot is not reliable on 32-bit systems at this time. A final area in which virtual memory allocations on Linux are crippled is that page directory entries are allocated using the equivalent of KM_SLEEP, which makes the use of kernel virtual memory for anything important practically impossible and has lead to a number of hacks in the current ZFSOnLinux Solaris Porting Layer that we would all like to retire once a major refactoring of the ZIO buffers is done. To fully answer your question, I would need to tell you what happens on 64-bit Linux when we do not have enough physical memory available to satisfy a kernel virtual memory allocation. Unfortunately, I do not know the answer to this; I would need to set aside at least an afternoon to read the Linux source code to figure it out. I am inclined to expect it to block indefinitely, but the following blog post that I found via Google suggests that the kernel actually will try to satisfy large virtual memory allocation to the point where it is willing to kill all userland programs and deadlock: http://kaiwantech.wordpress.com/2011/08/17/kmalloc-and-vmalloc-linux-kernel-memory-allocation-api-limits/ I hope to get ZFS into a reasonable state on 32-bit Linux soon with patches to restructure the ZIO buffers to use arrays of pages instead of slab objects. That should also eliminate the need for the hacks that currently exist in the SPL and get our allocation sizes down to no more than two contiguous 4KB pages (although I would be much happier to get everything below one 4KB page). With that in mind, I would greatly prefer to see the hash table implemented inside a b-tree to avoid the use of gigabyte-sized memory allocations (i.e. doing virtual memory manually). Illumos can be expected to handle such large allocations, even on 32-bit systems, but other kernels cannot.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
