Re: directory listing hangs in ufs state
On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: On 15.12.2011 17:01, Kostik Belousov wrote: On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes
Re: directory listing hangs in ufs state
On 12/22/2011 03:48, Kostik Belousov wrote: On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: On 15.12.2011 17:01, Kostik Belousov wrote: On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes to the vm_object_page_clean() and down the stack. Esp. due to custom page write
Re: directory listing hangs in ufs state
On 15.12.2011 17:01, Kostik Belousov wrote: On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes to the vm_object_page_clean() and down the stack. Esp. due to custom page write vops in several fses. The only guarantee that POSIX requires from msync(MS_SYNC) is that the writes are
Re: directory listing hangs in ufs state
On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? All of this would really fall into the hands of the mongodb people to figure out, if you ask me. But I should note that mmap() on BSD behaves and performs very differently than on, say, Linux; so if the authors wrote what they did intended for Linux systems, I wouldn't be too surprised. :-) https://jira.mongodb.org/browse/SERVER-663 I'm extremely confused by this problem. What you're describing above is that the process is stuck in biowr state for a long time, but what you stated originally was that the process was stuck in ufs state for a few minutes: Listing of the directory
Re: directory listing hangs in ufs state
On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes to the
Re: directory listing hangs in ufs state
Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I'm extremely confused by this problem. What you're describing above is that the process is stuck in biowr state for a long time, but what you stated originally was that the process was stuck in ufs state for a few minutes: I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A couple days ago I observed that listing of mongodb directory stuck in a few minutes in ufs state. Can we narrow down what we're talking about here? Does the process actually deadlock? Or are you concerned about performance implications? I know nothing about this mongodb software, but the reason it's calling msync() is because it wants to try and ensure that the data it changed in an mmap()-mapped page to be reflected (fully written) on the disk. This behaviour is fairly common within database software, but how often the software chooses to do this is entirely a design implementation choice by the authors. Meaning: if mongodb is either 1) continually calling msync(), or 2) waiting for too long a period of time before calling msync(), performance within the process will suffer. #1 could result in overall bad performance, while #2 could result in a process that's spending a lot of time doing I/O (flushing to disk) and therefore appears deadlocked when in fact the kernel/subsystems are doing exactly what they were told to do. Removing the msync() call could result in inconsistent data (possibly non-recoverable) if the mongodb software crashes or if some other piece (thread or child? Not sure) expects to open a new fd on that file which has mmap()'d data. This is about all I know. I would love to be able to tell you consider a different database but that seems like an excuse rather than an actual solution. I guess if all you're seeing is the process stall for long periods of time, but recover normally, then I would open up a support ticket with the mongodb folks to discuss performance. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On Wed, Dec 14, 2011 at 12:22 PM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). Yikes, I just looked at this man page. I'm afraid that the text in the BUGS section is highly misleading. The MS_INVALIDATE option should be obsolete for the reason given there. Under a strict reading of the applicable standard, FreeBSD could implement this option as a NOP. However, we treat it something like madvise(MADV_DONTNEED|FREE). In contrast, MS_SYNC is definitely not obsolete. Alan P.S. If someone wants to take a crack at fixing this man page, contact me off list. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On 14.12.2011 22:53, Alan Cox wrote: On Wed, Dec 14, 2011 at 12:22 PM, Jeremy Chadwick free...@jdc.parodius.com mailto:free...@jdc.parodius.com wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). Yikes, I just looked at this man page. I'm afraid that the text in the BUGS section is highly misleading. The MS_INVALIDATE option should be obsolete for the reason given there. Under a strict reading of the applicable standard, FreeBSD could implement this option as a NOP. However, we treat it something like madvise(MADV_DONTNEED|FREE). In contrast, MS_SYNC is definitely not obsolete. Alan P.S. If someone wants to take a crack at fixing this man page, contact me off list. Please don't remove support for MS_INVALIDATE, this is only one way to purge disk cache. MADV_DONTNEED does nothing here in my experience. -- Andrey Zonov ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. I'm extremely confused by this problem. What you're describing above is that the process is stuck in biowr state for a long time, but what you stated originally was that the process was stuck in ufs state for a few minutes: Listing of the directory with mongodb files by ls(1) stuck in ufs state when one of mongodb's thread in biowr state. It looks like system holds global lock of the file which is msync(2)-ed and can't immediately return from lstat(2) call. I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A couple days ago I observed that listing of mongodb directory stuck in a few minutes in ufs state. Can we narrow down what we're talking about here? Does the process actually deadlock? Or are you concerned about performance implications? I know nothing about this mongodb software, but the reason it's calling msync() is because it wants to try and ensure that the data it changed in an mmap()-mapped page to be reflected (fully written) on the disk. This behaviour is fairly common within database software, but how often the software chooses to do this is entirely a design implementation choice by the authors. Meaning: if mongodb is either 1) continually calling msync(), or 2) waiting for too long a period of time before calling msync(), performance within the process will suffer. #1 could result in overall bad performance, while #2 could result in a process that's spending a lot of time doing I/O (flushing to disk) and therefore appears deadlocked when in fact the kernel/subsystems are doing exactly what they were told to do. Removing the msync() call could result in inconsistent data (possibly non-recoverable) if the mongodb software crashes or if some other piece (thread or child? Not sure) expects to open a new fd on that file which has mmap()'d data. Yes, I clearly understand this. I think of any system tuning instead, but nothing arose in my head. This is about all I know. I would love to be able to tell you consider a different database but that seems like an excuse rather than an actual solution. I guess if all you're seeing is the process stall for long periods of time, but recover normally, then I would open up a support ticket with the mongodb folks to discuss performance. -- Andrey Zonov ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. All of this would really fall into the hands of the mongodb people to figure out, if you ask me. But I should note that mmap() on BSD behaves and performs very differently than on, say, Linux; so if the authors wrote what they did intended for Linux systems, I wouldn't be too surprised. :-) I'm extremely confused by this problem. What you're describing above is that the process is stuck in biowr state for a long time, but what you stated originally was that the process was stuck in ufs state for a few minutes: Listing of the directory with mongodb files by ls(1) stuck in ufs state when one of mongodb's thread in biowr state. It looks like system holds global lock of the file which is msync(2)-ed and can't immediately return from lstat(2) call. Thanks for the clarification -- yes this helps. To some degree it makes sense, some piece of the filesystem or VFS layer is blocking intentionally. How to figure out what layer I do not know. Kernel folks familiar with this aspect would need to chime in here. I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A couple days ago I observed that listing of mongodb directory stuck in a few minutes in ufs state. Can we narrow down what we're talking about here? Does the process actually deadlock? Or are you concerned about performance implications? I know nothing about this mongodb software, but the reason it's calling msync() is because it wants to try and ensure that the data it changed in an mmap()-mapped page to be reflected (fully written) on the disk. This behaviour is fairly common within database software, but how often the software chooses to do this is entirely a design implementation choice by the authors. Meaning: if mongodb is either 1) continually calling msync(), or 2) waiting for too long a
directory listing hangs in ufs state
Hi, I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A couple days ago I observed that listing of mongodb directory stuck in a few minutes in ufs state. I've run it again with ktrace and got following (kdump -R): 91324 ls 0.03 CALL lstat(0x32c199c8,0x32c19950) 91324 ls 0.03 NAMI base.1 91324 ls 21.357255 STRU struct stat {dev=116, ino=45125633, mode=-rw--- , nlink=1, uid=922, gid=922, rdev=180226648, atime=1323709877, stime=1323776461, ctime=1323776461, birthtime=1314798592, size=134217728, blksize=16384, blocks=262304, flags=0x0 } 91324 ls 0.14 RET lstat 0 kgdb backtrace of this process was looked like this: Thread 297 (Thread 100372): #0 sched_switch (td=0xff0095c008c0, newtd=0xff000357b8c0, flags=) at /usr/src/sys/kern/sched_ule.c:1866 #1 0x80406696 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:449 #2 0x8043c072 in sleepq_wait (wchan=0xff0103aaf7f8, pri=80) at /usr/src/sys/kern/subr_sleepqueue.c:609 #3 0x803e4a5a in __lockmgr_args (lk=0xff0103aaf7f8, flags=2097408, ilk=0xff0103aaf820, wmesg=) at /usr/src/sys/kern/kern_lock.c:220 #5 0x8061239c in ffs_lock (ap=0xff84867fc550) at lockmgr.h:94 #5 0x806d2462 in VOP_LOCK1_APV (vop=0x80921fe0, a=0xff84867fc550) at vnode_if.c:1988 #6 0x804a58b7 in _vn_lock (vp=0xff0103aaf760, flags=2097152, file=0x80736e70 /usr/src/sys/kern/vfs_subr.c, line=2137) at vnode_if.h:859 #7 0x80498bc0 in vget (vp=0xff0103aaf760, flags=2097408, td=0xff0095c008c0) at /usr/src/sys/kern/vfs_subr.c:2137 #8 0x804845f4 in cache_lookup (dvp=0xff0095675b10, vpp=0xff84867fc910, cnp=0xff84867fc938) at /usr/src/sys/kern/vfs_cache.c:587 #9 0x80484a30 in vfs_cache_lookup (ap=) at /usr/src/sys/kern/vfs_cache.c:905 #10 0x806d2e7c in VOP_LOOKUP_APV (vop=0x80922820, a=0xff84867fc790) at vnode_if.c:123 #11 0x8048bc80 in lookup (ndp=0xff84867fc8e0) at vnode_if.h:54 #12 0x8048cf0e in namei (ndp=0xff84867fc8e0) at /usr/src/sys/kern/vfs_lookup.c:269 #13 0x8049c972 in kern_statat_vnhook (td=0xff0095c008c0, flag=) at /usr/src/sys/kern/vfs_syscalls.c:2346 #14 0x8049cbb5 in kern_statat (td=) at /usr/src/sys/kern/vfs_syscalls.c:2327 #15 0x8049cc7a in lstat (td=) at /usr/src/sys/kern/vfs_syscalls.c:2390 #16 0x8043e7dd in syscallenter (td=0xff0095c008c0, sa=0xff84867fcbb0) at /usr/src/sys/kern/subr_trap.c:326 #17 0x8066a5eb in syscall (frame=0xff84867fcc50) at /usr/src/sys/amd64/amd64/trap.c:916 #18 0x806517f2 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:384 #19 0x3298f75c in ?? () The very first idea was to turn off name caching (set debug.vfscache to 0), but it didn't help. The second idea was to reboot, but it didn't help too. This directory locks fine. It has 10 files and 1 empty directory. Have you any ideas what is going on? or how to catch the problem? -- Andrey Zonov ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Hi, I've got STABLE-8 (r221983) with mongodb-1.8.1 installed on it. A couple days ago I observed that listing of mongodb directory stuck in a few minutes in ufs state. I've run it again with ktrace and got following (kdump -R): 91324 ls 0.03 CALL lstat(0x32c199c8,0x32c19950) 91324 ls 0.03 NAMI base.1 91324 ls 21.357255 STRU struct stat {dev=116, ino=45125633, mode=-rw--- , nlink=1, uid=922, gid=922, rdev=180226648, atime=1323709877, stime=1323776461, ctime=1323776461, birthtime=1314798592, size=134217728, blksize=16384, blocks=262304, flags=0x0 } 91324 ls 0.14 RET lstat 0 kgdb backtrace of this process was looked like this: Thread 297 (Thread 100372): #0 sched_switch (td=0xff0095c008c0, newtd=0xff000357b8c0, flags=) at /usr/src/sys/kern/sched_ule.c:1866 #1 0x80406696 in mi_switch (flags=260, newtd=0x0) at /usr/src/sys/kern/kern_synch.c:449 #2 0x8043c072 in sleepq_wait (wchan=0xff0103aaf7f8, pri=80) at /usr/src/sys/kern/subr_sleepqueue.c:609 #3 0x803e4a5a in __lockmgr_args (lk=0xff0103aaf7f8, flags=2097408, ilk=0xff0103aaf820, wmesg=) at /usr/src/sys/kern/kern_lock.c:220 #5 0x8061239c in ffs_lock (ap=0xff84867fc550) at lockmgr.h:94 #5 0x806d2462 in VOP_LOCK1_APV (vop=0x80921fe0, a=0xff84867fc550) at vnode_if.c:1988 #6 0x804a58b7 in _vn_lock (vp=0xff0103aaf760, flags=2097152, file=0x80736e70 /usr/src/sys/kern/vfs_subr.c, line=2137) at vnode_if.h:859 #7 0x80498bc0 in vget (vp=0xff0103aaf760, flags=2097408, td=0xff0095c008c0) at /usr/src/sys/kern/vfs_subr.c:2137 #8 0x804845f4 in cache_lookup (dvp=0xff0095675b10, vpp=0xff84867fc910, cnp=0xff84867fc938) at /usr/src/sys/kern/vfs_cache.c:587 #9 0x80484a30 in vfs_cache_lookup (ap=) at /usr/src/sys/kern/vfs_cache.c:905 #10 0x806d2e7c in VOP_LOOKUP_APV (vop=0x80922820, a=0xff84867fc790) at vnode_if.c:123 #11 0x8048bc80 in lookup (ndp=0xff84867fc8e0) at vnode_if.h:54 #12 0x8048cf0e in namei (ndp=0xff84867fc8e0) at /usr/src/sys/kern/vfs_lookup.c:269 #13 0x8049c972 in kern_statat_vnhook (td=0xff0095c008c0, flag=) at /usr/src/sys/kern/vfs_syscalls.c:2346 #14 0x8049cbb5 in kern_statat (td=) at /usr/src/sys/kern/vfs_syscalls.c:2327 #15 0x8049cc7a in lstat (td=) at /usr/src/sys/kern/vfs_syscalls.c:2390 #16 0x8043e7dd in syscallenter (td=0xff0095c008c0, sa=0xff84867fcbb0) at /usr/src/sys/kern/subr_trap.c:326 #17 0x8066a5eb in syscall (frame=0xff84867fcc50) at /usr/src/sys/amd64/amd64/trap.c:916 #18 0x806517f2 in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:384 #19 0x3298f75c in ?? () The very first idea was to turn off name caching (set debug.vfscache to 0), but it didn't help. The second idea was to reboot, but it didn't help too. This directory locks fine. It has 10 files and 1 empty directory. Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org