Re: Does anyone try kib's Sandy Bridge PCID patch (pcid.2.patch)?
On Mon, Jan 30, 2012 at 07:08:13PM +0800, Paul Ambrose wrote: ?? 2012??1??30?? 2:36??Kostik Belousov kostik...@gmail.com ?? On Mon, Jan 30, 2012 at 10:15:51AM +0800, Paul Ambrose wrote: I have two boxes, one is AMD Athlon 610e 2.4G with FreeBSD-current patched with pcid.2.patch? It works well without other issue and it seem the pcid patch does not affect other part of the kernel. The other one is Sandy Athlons do not have PCID and probably will never implement it. They use other tricks to get similar optimizations, transparently to the OS. Just curious, is this AMD similar optimizations Address Space Number (ASN) and Global flag US Patent 6,604,187. http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html This and the same-important next item 'The TLB Flush Filter' is what I referred to. I did not found anything about ASN in the AMD manual It is a transparent optimization, which does not require any OS support. Intel PCID is completely different, it shall be explicitely handled by OS. It is some consequence of the nested pages support, AFAIU. Bridge i5-2300 with FreeBSD 9 release patched with pcid.1.patch( the pcid.2.patch seems dependent on AVX and XSAVE stuffs which is available on -current). But it hangs up just in a few minutes. I doubt the nvidia-driver which is not recompiled with patched kernel is the root, I will check this out later, but does anyone meet similar problem? There are two many variations compared to the config I did tested. I do not see anything obvious in the changes between HEAD and stable/9 which could be blamed. Nvidia driver might be bigger suspect, but again, I am not aware of anything wrong with it. I have two question about the pcid.2.patch Item 2 is clean, I fixed it. For the item 1, I was only able to decipher the proposal to optimize the global shootdown handler to restore the %cr3 with bit 64 set to not invalidate current PCID. Is there some more changes ? yes, that is what I meant. I was wondering using another way that each process has different pcid in each active processor, just as the freebsd mips and powerpc uses. But obviously this way is more friendly to non-pcid x86 processor. Each vmspace (or pmap) has unique PCID with the patch, at least until PCID space (12bit) is not exhausted. To really exhaust it, you need 4095 processes, so it is unlikely but possible event with the current settings. pgpiiEGbdTChh.pgp Description: PGP signature
Re: [ptrace] please review follow fork/exec changes
On Mon, Jan 30, 2012 at 10:26:25AM -0800, Dmitry Mikulin wrote: On 01/28/2012 11:48 PM, Kostik Belousov wrote: On Fri, Jan 27, 2012 at 10:12:13AM -0800, Dmitry Mikulin wrote: Attached are 4 separate patches for each somewhat independent portion of the kernel work related to the follow-fork implementation. Ok, as I said, I think that vfork-fork.patch is just wrong. Gdb needs to be able to read/write process memory between the time the child is forked and exec is called (in the vfork case). Without the change it causes a kernel panic when gdb tries to read/write process memory. Since my understanding of the kernel is a bit limited, it was the best I could do at the time. I will send more details about the panic once I get a working fbsd system again. Maybe there's a better way to deal with the panic. Please provide more details, I am looking forward for the panic message and backtrace. Lets postpone discussion of the orphan.patch for later. OK. The follow-fork.patch and follow-exec.patch make me wonder, and I already expressed my doubts. IMO, all features, except one bug, are already presented in the stock src. More, suggested follow-{fork,exec} patches break the SCE/SCX tracers notification of fork and exec events, since TDB_FORK and TBD_EXEC flags are consumed before syscall returns (I also said this previously). Namely, if the process is being debugged, the successfull [f]execve() causes unconditional stop even. This makes PT_FOLLOW_EXEC unneccessary. Existing PT_FOLLOW_FORK implementation indeed has a bug, which was not revealed by my testing during the development, because I only tested SCE/SCX scenario. Namely, if PT_FOLLOW_FORK is requested, but the next stop is not SCX, then follow-fork notification is not sent. After this nit is fixed, PT_FOLLOW_FORK caller gets stops for the child creation. Child is put into stop state as needed to not loose it. I think this will fix only a part of the problem, the one that relates to PT_CONTINUE. I still need the change that forces a stop in both child and parent on fork(). Without my changes the notification is generated in the child but not in the parent. I need to be able to have both processes stopped in gdb in order to clean up and detach from the parent, and initialize and attach to the child. The main reason I need both processes stopped is that gdb has to be able to read/write into both processes address space and registers. Ideally I would like to have a single event generated for fork() at a point where both child and parent are stopped and available for ptrace read/write requests. Do you think it's possible? The lack of the notification for parent is exactly what the patch I posted fixes. More exactly, it is the lack of notification for parent with PT_CONTINUE loop. I will commit this today. Regarding a single notification. Currently, parent arrives at the syscall return code only after the child is attached to the debugger. See the cv_wait() in kern_fork.c:739. In other words, if you get the PL_FLAG_FORK, the child is already attached (at last it shall be). My scescx.c code illustrates this well, IMO. You still get a separate stop from the child, but I do not see how is it harmful. pgpDwYkpcgypf.pgp Description: PGP signature
Re: Does anyone try kib's Sandy Bridge PCID patch (pcid.2.patch)?
On Mon, Jan 30, 2012 at 10:15:51AM +0800, Paul Ambrose wrote: I have two boxes, one is AMD Athlon 610e 2.4G with FreeBSD-current patched with pcid.2.patch? It works well without other issue and it seem the pcid patch does not affect other part of the kernel. The other one is Sandy Athlons do not have PCID and probably will never implement it. They use other tricks to get similar optimizations, transparently to the OS. Bridge i5-2300 with FreeBSD 9 release patched with pcid.1.patch( the pcid.2.patch seems dependent on AVX and XSAVE stuffs which is available on -current). But it hangs up just in a few minutes. I doubt the nvidia-driver which is not recompiled with patched kernel is the root, I will check this out later, but does anyone meet similar problem? There are two many variations compared to the config I did tested. I do not see anything obvious in the changes between HEAD and stable/9 which could be blamed. Nvidia driver might be bigger suspect, but again, I am not aware of anything wrong with it. I have two question about the pcid.2.patch Item 2 is clean, I fixed it. For the item 1, I was only able to decipher the proposal to optimize the global shootdown handler to restore the %cr3 with bit 64 set to not invalidate current PCID. Is there some more changes ? Thank you for looking at the patch. pgpkG36NzMMe1.pgp Description: PGP signature
Re: [ptrace] please review follow fork/exec changes
On Fri, Jan 27, 2012 at 10:12:13AM -0800, Dmitry Mikulin wrote: Attached are 4 separate patches for each somewhat independent portion of the kernel work related to the follow-fork implementation. Ok, as I said, I think that vfork-fork.patch is just wrong. Lets postpone discussion of the orphan.patch for later. The follow-fork.patch and follow-exec.patch make me wonder, and I already expressed my doubts. IMO, all features, except one bug, are already presented in the stock src. More, suggested follow-{fork,exec} patches break the SCE/SCX tracers notification of fork and exec events, since TDB_FORK and TBD_EXEC flags are consumed before syscall returns (I also said this previously). Namely, if the process is being debugged, the successfull [f]execve() causes unconditional stop even. This makes PT_FOLLOW_EXEC unneccessary. Existing PT_FOLLOW_FORK implementation indeed has a bug, which was not revealed by my testing during the development, because I only tested SCE/SCX scenario. Namely, if PT_FOLLOW_FORK is requested, but the next stop is not SCX, then follow-fork notification is not sent. After this nit is fixed, PT_FOLLOW_FORK caller gets stops for the child creation. Child is put into stop state as needed to not loose it. I updated the test program I use to test this functionality, see http://people.freebsd.org/~kib/misc/scescx.c The default or -s flag causes it to use SCE/SCX tracing, while -c flag switches it to use PT_CONTINUE tracing, which should be the kind of loop performed by normal debuggers. You can see the exec/fork events and child auto-attach illustrated with this test. Can you, please, provide the test-case which illustrates the omissions in the existing API (with the patch below applied), if any ? diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c index bba4479..75328f6 100644 --- a/sys/kern/subr_syscall.c +++ b/sys/kern/subr_syscall.c @@ -212,7 +212,8 @@ syscallret(struct thread *td, int error, struct syscall_args *sa __unused) * executes. If debugger requested tracing of syscall * returns, do it now too. */ - if (traced ((td-td_dbgflags TDB_EXEC) != 0 || + if (traced + ((td-td_dbgflags (TDB_FORK | TDB_EXEC)) != 0 || (p-p_stops S_PT_SCX) != 0)) ptracestop(td, SIGTRAP); td-td_dbgflags = ~(TDB_SCX | TDB_EXEC | TDB_FORK); pgp116s22XAVu.pgp Description: PGP signature
Re: knlist_empty locking fix
On Thu, Jan 26, 2012 at 01:03:26PM -0800, Doug Ambrisko wrote: Ran into problems with running kqueue/aio with WITNESS etc. Sometimes things are locked sometimes not. knlist_remove is called telling it whether it is locked or not ie: extern voidknlist_remove(struct knlist *knl, struct knote *kn, int islocked); so I changed: extern int knlist_empty(struct knlist *knl); to: extern int knlist_empty(struct knlist *knl, int islocked); and then updated things to reflect that following what that state of the lock for knlist_remove. If it is not locked, it gets a lock and frees it after. This now fixes a panic when a process using kqueue/aio is killed on shutdown with WITNESS. It changes an API/ABI so it probably can't merged back. If there are no objections then I'll commit it. Change to knlist_init() does not make sense at all, the knlist shall not be exposed to other consumers during initialization, so no need to exclude the parallel access. Regarding the knlist_empty(), I propose to keep it as is. Locking the knlist inside knlist_empty() does not make sense, because lock is immediately dropped afterward, and relocked for remove. This way, the entry could be removed from the list meantime (can it, really ?). I think that you should take a lock around the whole if() {} statement, and call knlist_remove with locked == 1. pgpCSJM8dw3vN.pgp Description: PGP signature
Re: knlist_empty locking fix
On Fri, Jan 27, 2012 at 09:42:58AM -0800, Doug Ambrisko wrote: Andrew Boyer writes: | On Jan 27, 2012, at 12:15 PM, Doug Ambrisko wrote: | | John Baldwin writes: | | Agreed, I think the missing locking should just be added to the aio code. | | Okay so then just: | | Index: vfs_aio.c | === | RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/vfs_aio.c,v | retrieving revision 1.243.2.3.4.1 | diff -u -p -r1.243.2.3.4.1 vfs_aio.c | --- vfs_aio.c 21 Dec 2010 17:09:25 - 1.243.2.3.4.1 | +++ vfs_aio.c 27 Jan 2012 17:07:11 - | @@ -2509,9 +2509,12 @@ static void | filt_aiodetach(struct knote *kn) | { |struct aiocblist *aiocbe = kn-kn_ptr.p_aio; | + struct knlist *knl = aiocbe-klist; | | - if (!knlist_empty(aiocbe-klist)) | - knlist_remove(aiocbe-klist, kn, 0); | + knl-kl_lock(knl-kl_lockarg); | + if (!knlist_empty(knl)) | + knlist_remove(knl, kn, 1); | + knl-kl_unlock(knl-kl_lockarg); | } | | /* kqueue filter function */ | | I was trying to be consistant with knlist_remove but this is a much | smaller change that can be merge to older branches. | | Does filt_liodetach() need the same treatment? Yes, I had that in the original. I updated that and optimized the knl to just get the structure needed. Index: vfs_aio.c === RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/vfs_aio.c,v retrieving revision 1.243.2.3.4.1 diff -u -p -r1.243.2.3.4.1 vfs_aio.c --- vfs_aio.c 21 Dec 2010 17:09:25 - 1.243.2.3.4.1 +++ vfs_aio.c 27 Jan 2012 17:35:47 - @@ -2508,10 +2508,12 @@ filt_aioattach(struct knote *kn) static void filt_aiodetach(struct knote *kn) { - struct aiocblist *aiocbe = kn-kn_ptr.p_aio; + struct knlist *knl = kn-kn_ptr.p_aio-klist; - if (!knlist_empty(aiocbe-klist)) - knlist_remove(aiocbe-klist, kn, 0); + knl-kl_lock(knl-kl_lockarg); + if (!knlist_empty(knl)) + knlist_remove(knl, kn, 1); + knl-kl_unlock(knl-kl_lockarg); } /* kqueue filter function */ @@ -2553,10 +2555,12 @@ filt_lioattach(struct knote *kn) static void filt_liodetach(struct knote *kn) { - struct aioliojob * lj = kn-kn_ptr.p_lio; + struct knlist *knl = kn-kn_ptr.p_lio-klist; It is easy to be style-compiant there and move initialization of knl after the blank line. Do you need two different functions now ? I think you can leave just one. Otherwise looks fine. - if (!knlist_empty(lj-klist)) - knlist_remove(lj-klist, kn, 0); + knl-kl_lock(knl-kl_lockarg); + if (!knlist_empty(knl)) + knlist_remove(knl, kn, 1); + knl-kl_unlock(knl-kl_lockarg); } /* kqueue filter function */ Thanks, Doug A. pgpoFy8bRMgwv.pgp Description: PGP signature
Re: [ptrace] please review follow fork/exec changes
On Wed, Jan 25, 2012 at 03:48:04PM -0800, Dmitry Mikulin wrote: Thanks for taking a look at it. I'll try to explain the changes the best I can: the work was done nearly 6 months ago... I would certainly appreciate some more words describing the changes. What is the goal of introducing the PT_FOLLOW_EXEC ? To not force the debugger to filter all syscall exits to see exec events ? PT_FOLLOW_EXEC was added to trace the exec() family of calls. When it's enabled, the kernel will generate a trap to give debugger a chance to clean up old process info and initialize its state with the new binary. It was more a question, why PT_FLAG_EXEC is not enough. Why did you moved the stopevent/ptracestop for exec events from syscallret() to kern_execve() ? If moving, why the handling of TDB_EXEC is not removed from syscallret() ? I do not think that TDB_EXEC can be seen there after the patch. The same question about TDB_FORK. The reason I moved the event notifications from syscallret() is because the debugger is interested successful completion of either fork or exec, not just the fact that they have been called. If, say, a call to fork() failed, as far as debugger is concerned, fork() might as well had never been called. Having a ptracestop in syscallret triggered a trap on every return from fork without telling the debugger whether a new process had been created or not. Same goes for exec(). PT_FLAG_EXEC is only set if an exec-kind of syscall succeeded. The same is true for PT_FLAG_FORKED, the flag is set only if a new child was successfully created. Looking at the code now I think I should have removed TDB_EXEC/TDB_FORK processing from syscallret(). I'll do that and verify that it works as expected. If possible, I would greatly prefer to have fork changes separated. You mean separate fork changes into a separate patch from exec changes? Yes. Even more, it seems that fork changes should be separated into bug fixes and new functionality. I doubt that disallowing RFMEM while tracing is the right change. It may be to fix some (undescribed) bug, but RFMEM is documented behaviour used not only for vfork(2), and the change just breaks rfork(2) for debugged processes. Even vfork() should not be broken, since I believe there are programs that rely on the vfork() model, in particular, C shell. It will be broken if vfork() is substituted by fork(). The fact that it breaks only under debugger will make it esp. confusing. I need to dig up the details since I can't recall the exact reason for forcing fork() in cases of user calls to vfork() under gdb. I believe it had to do with when child process memory was available for writing in case of vfork() and it wasn't early enough to complete the switch over from parent to child in gdb. After consulting with our internal kernel experts we decided that doing fork() instead of vfork() under gdb is a low impact change. Why do we need to have TDB_FORK set for td2 ? The debugger needs to intercept fork() in both parent and child so it can detach from the old process and attach to the new one. Maybe it'll make more sense in the context of gdb changes. Should I send them too? Don't think Marcel included that patch... No, I think the kernel changes are enough for now. Does the orphan list change intended to not lost the child after fork ? But the child shall be traced, so debugger would get the SIGTRAP on the attach on fork returning to usermode. I remember that I explicitely tested this when adding followfork changes. Yes, the debugger gets SIGTRAPs. The problem arises when the real parent of the forked process has the code to collect termination status. Since attaching to a process changes the parent/child relationships, we need to keep track of the children lost due to re-parenting so we can properly attribute their exit status to the real parent. I do not quite understand it. From the description it sounds as the problem that is not related to follow fork changes at all, and shall be presented regardless of the follow fork. If yes, I think this shall be separated into a standalone patch. pgphUacrPo7lj.pgp Description: PGP signature
Re: panic: No NCF_TS
On Mon, Jan 23, 2012 at 06:54:49AM -0800, John Baldwin wrote: On 1/22/12 7:05 PM, Kostik Belousov wrote: On Mon, Jan 23, 2012 at 05:36:42AM +0400, Yuri Pankov wrote: Seems to be reproducible here running r230467 as the NFS client and r230135 as NFS server. NFSv4 not enabled. # mount [...] sirius:/data/distfiles on /usr/ports/distfiles (nfs) # /usr/bin/env /usr/bin/fetch -AFpr -S 4682084 -o /usr/ports/distfiles/sqlite-src-3071000.zip http://www.sqlite.org/sqlite-src-3071000.zip /usr/ports/distfiles/sqlite-src-3071000.zip 100% of 4572 kB 379 kBps 00m00s # rm /usr/ports/distfiles/sqlite-src-3071000.zip immediately followed by: panic: No NCF_TS cpuid = 2 KDB: enter: panic [ thread pid 1603 tid 100494 ] Stopped at kdb_enter+0x3e: movq$0,kdb_why db bt Tracing pid 1603 tid 100494 td 0xfe0089585460 kdb_enter() at kdb_enter+0x3e panic() at panic+0x245 cache_lookup_times() at cache_lookup_times+0x6b5 nfs_lookup() at nfs_lookup+0x190 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8b lookup() at lookup+0x7e9 namei() at namei+0x74c kern_statat_vnhook() at kern_statat_vnhook+0x90 sys_lstat() at sys_lstat+0x30 amd64_syscall() at amd64_syscall+0x221 Xfast_syscall() at Xfast_syscall+0xfb --- syscall (190, FreeBSD ELF64, sys_lstat), rip = 0x80093ff3c, rsp = 0x7fffd8d8, rbp = 0x7fffd979 --- Yes, my bad. I wrote the r230441 with the assumption that filesystems are consistent in their use of cache_enter_time(). And my net-booting test box did not catched this, which is at least interesting. Please try the change below. Actually, it should be enough to only apply the changes for fs/nfsclient (or nfsclient/ if you use old nfs). If this does not help, then please try the whole patch. I think we should have the existing assertion. If cache_lookup_times() is called with a timestamp requested, then returning a name cache entry without a timestamp is just going to result in that name cache entry not being used. Panic'ing in that case is correct. With regard to the NFS changes below, all of these are bugs that didn't really work right before. Specifically, adding a positive entry without setting n_ctime previously would have just resulted in the name cache entry being purged on the next lookup anyway. Keep in mind that the timestamp for a give name cache entry in NFS needs to match an actual timestamp returned by the NFS server as post-op attributes in some RPC. Using the timestamp from vfs_timestamp() is completely bogus. Instead, the timestamp for a negative entry needs to be the mtime of the parent directory. If we don't have that timestamp handy, then we should just not add a namecache entry at all. Also, the vap-va_ctime used below for mknod() and create() is not a timestamp from the server, so it is also suspect. I can look at this in more detail on Wednesday, but my best guess is that nearly all (if not all) of these cache_enter() calls will simply need to be removed. Note that other filesystems like UFS don't bother creating name cache entries for things like create or mknod. It's debatable if the NFS client should even be creating any name cache entries outside of lookup and when handling a READDIR+ reply. Ok, next try. I did removed the cache_enter calls for old nfs client, but it seems that the calls can be kept for the new client. diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c index 2747191..709cd8d 100644 --- a/sys/fs/nfsclient/nfs_clvnops.c +++ b/sys/fs/nfsclient/nfs_clvnops.c @@ -1431,8 +1431,6 @@ nfs_mknodrpc(struct vnode *dvp, struct vnode **vpp, struct componentname *cnp, } } if (!error) { - if ((cnp-cn_flags MAKEENTRY)) - cache_enter(dvp, newvp, cnp); *vpp = newvp; } else if (NFS_ISV4(dvp)) { error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid, @@ -1591,7 +1589,7 @@ again: } if (!error) { if (cnp-cn_flags MAKEENTRY) - cache_enter(dvp, newvp, cnp); + cache_enter_time(dvp, newvp, cnp, vap-va_ctime); *ap-a_vpp = newvp; } else if (NFS_ISV4(dvp)) { error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid, @@ -1966,8 +1964,9 @@ nfs_link(struct vop_link_args *ap) * must care about lookup caching hit rate, so... */ if (VFSTONFS(vp-v_mount)-nm_negnametimeo != 0 - (cnp-cn_flags MAKEENTRY)) - cache_enter(tdvp, vp, cnp); + (cnp-cn_flags MAKEENTRY) dattrflag) { + cache_enter_time(tdvp, vp, cnp, dnfsva.na_mtime); + } if (error NFS_ISV4(vp)) error = nfscl_maperr(cnp-cn_thread, error, (uid_t)0, (gid_t)0); @@ -2023,15 +2022,6 @@ nfs_symlink(struct vop_symlink_args *ap) error
Re: nullfs broken on powerpc
On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote: Hi, as I succesfully installed FreeBSD on Mac Mini (older powerpc model with G4 CPU) I tried to use mount_nullfs to share some files for more systems. I use this method for years on i386 and amd64 platforms for years now to share sources and other files. Basically I want to have small system specific slice/partition and large shared data area. System sources are in shared area as well as some working area (think /usr/src and /usr/obj directories). This does not work with powerpc for me. With sources csup'ped this morning, full system rebuild with GENERIC kernel, it is enough for me to issue mount_nullfs /data/src10 /usr/src csup /usr/share/examples/cvsup/standard-supfile and system panic occurs, with following on system console: panic: mtx_lock() of spin mutex (null) @ /usr/src/sys/kern/vfs_subr.c:2670 cpuid = 0 KDB: enter: panic [ thread pid 1442 tid 100095 ] Stopped at 0x40f734: addi r0, r0, 0x0 db At this point, I am able to interact with system, the question for me is what I want to get from it :) I tried bt with following result: Tracing pid 1442 tid 100095 td 0x2d6b000 0xe22c26d0: at panic+0x274 0xe22c2730: at _mtx_lock_flags+0xc4 0xe22c2760: at vgonel+0x330 0xe22c27b0: at vrecycle+0x54 0xe22c27d0: at null_inactive+0x30 0xe22c27f0: at VOP_INACTIVE_APV+0xdc 0xe22c2810: at vinactive+0x98 0xe22c2850: at vputx+0x344 0xe22c28a0: at vput+0x18 0xe22c28c0: at kern_statat_vnhook+0x108 0xe22c29d0: at kern_statat+0x18 0xe22c29f0: at kern_lstat+0x2c 0xe22c2a10: at sys_lstat+0x30 0xe22c2a90: at trap+0x388 0xe22c2b60: at powerpc_interrupt+0x108 0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032 r1=0xffaf9a70 cr=0x28004044 xer=0x2000 ctr=0x41a0ac40 db Does this shed any light for someone with more knowledge here? My gut feeling is there is some endianness issue at play, the same nullfs usage works for me flawlessly on both i386 and amd64 systems, so it could not be 32 vs 64 bit issue at least. At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function void vgonel(struct vnode *vp) VI_LOCK(vp); vp-v_vnlock = vp-v_lock; vp-v_op = dead_vnodeops; vp-v_tag = none; vp-v_type = VBAD; } so the question seems to be reduced to 'why is vp null?' or is my small attempt on analyse flawed... I do not think that the vp is null. It more look like the *vp memory was zeroed. This has very low chances of being related to endianess, and more like a kernel memory corruption. Take a dump and print the content of *vp. pgpNGYFTFCBWD.pgp Description: PGP signature
Re: nullfs broken on powerpc
On Wed, Jan 25, 2012 at 08:50:41PM +0100, Milan Obuch wrote: On Wed, 25 Jan 2012 14:21:23 +0200 Kostik Belousov kostik...@gmail.com wrote: On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote: Hi, [ snip ] This does not work with powerpc for me. With sources csup'ped this morning, full system rebuild with GENERIC kernel, it is enough for me to issue mount_nullfs /data/src10 /usr/src csup /usr/share/examples/cvsup/standard-supfile and system panic occurs, with following on system console: panic: mtx_lock() of spin mutex (null) @ /usr/src/sys/kern/vfs_subr.c:2670 cpuid = 0 KDB: enter: panic [ thread pid 1442 tid 100095 ] Stopped at 0x40f734: addi r0, r0, 0x0 db At this point, I am able to interact with system, the question for me is what I want to get from it :) I tried bt with following result: Tracing pid 1442 tid 100095 td 0x2d6b000 0xe22c26d0: at panic+0x274 0xe22c2730: at _mtx_lock_flags+0xc4 0xe22c2760: at vgonel+0x330 0xe22c27b0: at vrecycle+0x54 0xe22c27d0: at null_inactive+0x30 0xe22c27f0: at VOP_INACTIVE_APV+0xdc 0xe22c2810: at vinactive+0x98 0xe22c2850: at vputx+0x344 0xe22c28a0: at vput+0x18 0xe22c28c0: at kern_statat_vnhook+0x108 0xe22c29d0: at kern_statat+0x18 0xe22c29f0: at kern_lstat+0x2c 0xe22c2a10: at sys_lstat+0x30 0xe22c2a90: at trap+0x388 0xe22c2b60: at powerpc_interrupt+0x108 0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032 r1=0xffaf9a70 cr=0x28004044 xer=0x2000 ctr=0x41a0ac40 db Does this shed any light for someone with more knowledge here? My gut feeling is there is some endianness issue at play, the same nullfs usage works for me flawlessly on both i386 and amd64 systems, so it could not be 32 vs 64 bit issue at least. At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function void vgonel(struct vnode *vp) VI_LOCK(vp); vp-v_vnlock = vp-v_lock; vp-v_op = dead_vnodeops; vp-v_tag = none; vp-v_type = VBAD; } so the question seems to be reduced to 'why is vp null?' or is my small attempt on analyse flawed... I do not think that the vp is null. It more look like the *vp memory was zeroed. This has very low chances of being related to endianess, and more like a kernel memory corruption. Take a dump and print the content of *vp. How could I look into memory? I found page http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-online-ddb.html and I can see registers (show reg), use x with absolute addresses, but something like 'x vp' tells just 'Symbol not known' - should I somehow load symbol table into memory? But backtrace shows function names... or should I somehow modify GENERIC kernel to include more debugging info? Kernel debugging is a bit new for me, even if I can write simple modification into kernel, but only in some special (and narrow) area of code... You shall/could take a dump and then use kgdb to look at *vp. If doing from ddb, you need to look at the disassembly of the function to undestand where to find the vp (probably in some register). pgp34H56yP2bJ.pgp Description: PGP signature
Re: nullfs broken on powerpc
On Wed, Jan 25, 2012 at 10:00:26PM +0100, Andreas Tobler wrote: On 25.01.12 21:29, Eitan Adler wrote: On Wed, Jan 25, 2012 at 2:50 PM, Milan Obuchfreebsd-curr...@dino.sk wrote: On Wed, 25 Jan 2012 14:21:23 +0200 Kostik Belousovkostik...@gmail.com wrote: On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote: Hi, [ snip ] This does not work with powerpc for me. With sources csup'ped this morning, full system rebuild with GENERIC kernel, it is enough for me to issue mount_nullfs /data/src10 /usr/src csup /usr/share/examples/cvsup/standard-supfile and system panic occurs, with following on system console: panic: mtx_lock() of spin mutex (null) @ /usr/src/sys/kern/vfs_subr.c:2670 cpuid = 0 KDB: enter: panic [ thread pid 1442 tid 100095 ] Stopped at 0x40f734: addi r0, r0, 0x0 db At this point, I am able to interact with system, the question for me is what I want to get from it :) I tried bt with following result: Tracing pid 1442 tid 100095 td 0x2d6b000 0xe22c26d0: at panic+0x274 0xe22c2730: at _mtx_lock_flags+0xc4 0xe22c2760: at vgonel+0x330 0xe22c27b0: at vrecycle+0x54 0xe22c27d0: at null_inactive+0x30 0xe22c27f0: at VOP_INACTIVE_APV+0xdc 0xe22c2810: at vinactive+0x98 0xe22c2850: at vputx+0x344 0xe22c28a0: at vput+0x18 0xe22c28c0: at kern_statat_vnhook+0x108 0xe22c29d0: at kern_statat+0x18 0xe22c29f0: at kern_lstat+0x2c 0xe22c2a10: at sys_lstat+0x30 0xe22c2a90: at trap+0x388 0xe22c2b60: at powerpc_interrupt+0x108 0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032 r1=0xffaf9a70 cr=0x28004044 xer=0x2000 ctr=0x41a0ac40 db Does this shed any light for someone with more knowledge here? My gut feeling is there is some endianness issue at play, the same nullfs usage works for me flawlessly on both i386 and amd64 systems, so it could not be 32 vs 64 bit issue at least. At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function void vgonel(struct vnode *vp) VI_LOCK(vp); vp-v_vnlock =vp-v_lock; vp-v_op =dead_vnodeops; vp-v_tag = none; vp-v_type = VBAD; } so the question seems to be reduced to 'why is vp null?' or is my small attempt on analyse flawed... I do not think that the vp is null. It more look like the *vp memory was zeroed. This has very low chances of being related to endianess, and more like a kernel memory corruption. Take a dump and print the content of *vp. How could I look into memory? I found page http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-online-ddb.html and I can see registers (show reg), use x with absolute addresses, but something like 'x vp' tells just 'Symbol not known' - should I somehow load symbol table into memory? But backtrace shows function names... or should I somehow modify GENERIC kernel to include more debugging info? Kernel debugging is a bit new for me, even if I can write simple modification into kernel, but only in some special (and narrow) area of code... From ddb write 'call doadump'. Provided you have a proper dump device set up in rc.conf it should work. You could then use kgdb from a running computer to analyze the dump in more detail. This only works if your target is booke, AIM (Apple based machines) do not have the 'call doadump' implemented yet. It is somewhere on my long todo list. FWIW, it is not 'call doadump', it is just 'dump' for some time. I think calling doadump does not work. pgprHbTfgnpRg.pgp Description: PGP signature
Re: [ptrace] please review follow fork/exec changes
On Tue, Jan 24, 2012 at 01:36:55PM -0800, Marcel Moolenaar wrote: All, Please review the attached changes (done by Dmitry -- CC'd) to add support for PT_FOLLOW_EXEC and cleanup PT_FOLLOW_FORK. I'll commit this when there are no comments/objections. Thanks, I would certainly appreciate some more words describing the changes. What is the goal of introducing the PT_FOLLOW_EXEC ? To not force the debugger to filter all syscall exits to see exec events ? Why did you moved the stopevent/ptracestop for exec events from syscallret() to kern_execve() ? If moving, why the handling of TDB_EXEC is not removed from syscallret() ? I do not think that TDB_EXEC can be seen there after the patch. The same question about TDB_FORK. If possible, I would greatly prefer to have fork changes separated. I doubt that disallowing RFMEM while tracing is the right change. It may be to fix some (undescribed) bug, but RFMEM is documented behaviour used not only for vfork(2), and the change just breaks rfork(2) for debugged processes. Even vfork() should not be broken, since I believe there are programs that rely on the vfork() model, in particular, C shell. It will be broken if vfork() is substituted by fork(). The fact that it breaks only under debugger will make it esp. confusing. Why do we need to have TDB_FORK set for td2 ? Does the orphan list change intended to not lost the child after fork ? But the child shall be traced, so debugger would get the SIGTRAP on the attach on fork returning to usermode. I remember that I explicitely tested this when adding followfork changes. pgp3FXLhsTAQ8.pgp Description: PGP signature
Re: panic: No NCF_TS
On Mon, Jan 23, 2012 at 05:36:42AM +0400, Yuri Pankov wrote: Seems to be reproducible here running r230467 as the NFS client and r230135 as NFS server. NFSv4 not enabled. # mount [...] sirius:/data/distfiles on /usr/ports/distfiles (nfs) # /usr/bin/env /usr/bin/fetch -AFpr -S 4682084 -o /usr/ports/distfiles/sqlite-src-3071000.zip http://www.sqlite.org/sqlite-src-3071000.zip /usr/ports/distfiles/sqlite-src-3071000.zip 100% of 4572 kB 379 kBps 00m00s # rm /usr/ports/distfiles/sqlite-src-3071000.zip immediately followed by: panic: No NCF_TS cpuid = 2 KDB: enter: panic [ thread pid 1603 tid 100494 ] Stopped atkdb_enter+0x3e: movq$0,kdb_why db bt Tracing pid 1603 tid 100494 td 0xfe0089585460 kdb_enter() at kdb_enter+0x3e panic() at panic+0x245 cache_lookup_times() at cache_lookup_times+0x6b5 nfs_lookup() at nfs_lookup+0x190 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8b lookup() at lookup+0x7e9 namei() at namei+0x74c kern_statat_vnhook() at kern_statat_vnhook+0x90 sys_lstat() at sys_lstat+0x30 amd64_syscall() at amd64_syscall+0x221 Xfast_syscall() at Xfast_syscall+0xfb --- syscall (190, FreeBSD ELF64, sys_lstat), rip = 0x80093ff3c, rsp = 0x7fffd8d8, rbp = 0x7fffd979 --- Yes, my bad. I wrote the r230441 with the assumption that filesystems are consistent in their use of cache_enter_time(). And my net-booting test box did not catched this, which is at least interesting. Please try the change below. Actually, it should be enough to only apply the changes for fs/nfsclient (or nfsclient/ if you use old nfs). If this does not help, then please try the whole patch. Even if not needed for correctness, I think vfs_cache.c change could be made useful by adding KASSERT to it. commit 9771eb4739170d014fcebfbad07bfed4076c6d85 Author: Konstantin Belousov kos...@pooma.home Date: Mon Jan 23 04:36:08 2012 +0200 Make NCF_TS mismatch not fatal. Consistently use cache_enter_time in nfs clients. diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c index 2747191..2841647 100644 --- a/sys/fs/nfsclient/nfs_clvnops.c +++ b/sys/fs/nfsclient/nfs_clvnops.c @@ -1432,7 +1432,7 @@ nfs_mknodrpc(struct vnode *dvp, struct vnode **vpp, struct componentname *cnp, } if (!error) { if ((cnp-cn_flags MAKEENTRY)) - cache_enter(dvp, newvp, cnp); + cache_enter_time(dvp, newvp, cnp, vattr.va_ctime); *vpp = newvp; } else if (NFS_ISV4(dvp)) { error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid, @@ -1591,7 +1591,7 @@ again: } if (!error) { if (cnp-cn_flags MAKEENTRY) - cache_enter(dvp, newvp, cnp); + cache_enter_time(dvp, newvp, cnp, vap-va_ctime); *ap-a_vpp = newvp; } else if (NFS_ISV4(dvp)) { error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid, @@ -1923,6 +1923,7 @@ nfs_link(struct vop_link_args *ap) struct componentname *cnp = ap-a_cnp; struct nfsnode *np, *tdnp; struct nfsvattr nfsva, dnfsva; + struct timespec ts; int error = 0, attrflag, dattrflag; if (vp-v_mount != tdvp-v_mount) { @@ -1966,8 +1967,10 @@ nfs_link(struct vop_link_args *ap) * must care about lookup caching hit rate, so... */ if (VFSTONFS(vp-v_mount)-nm_negnametimeo != 0 - (cnp-cn_flags MAKEENTRY)) - cache_enter(tdvp, vp, cnp); + (cnp-cn_flags MAKEENTRY)) { + vfs_timestamp(ts); + cache_enter_time(tdvp, vp, cnp, ts); + } if (error NFS_ISV4(vp)) error = nfscl_maperr(cnp-cn_thread, error, (uid_t)0, (gid_t)0); @@ -1987,6 +1990,7 @@ nfs_symlink(struct vop_symlink_args *ap) struct nfsfh *nfhp; struct nfsnode *np = NULL, *dnp; struct vnode *newvp = NULL; + struct timespec ts; int error = 0, attrflag, dattrflag, ret; vap-va_type = VLNK; @@ -2030,8 +2034,10 @@ nfs_symlink(struct vop_symlink_args *ap) * must care about lookup caching hit rate, so... */ if (VFSTONFS(dvp-v_mount)-nm_negnametimeo != 0 - (cnp-cn_flags MAKEENTRY)) - cache_enter(dvp, newvp, cnp); + (cnp-cn_flags MAKEENTRY)) { + vfs_timestamp(ts); + cache_enter_time(dvp, newvp, cnp, ts); + } *ap-a_vpp = newvp; } @@ -2063,6 +2069,7 @@ nfs_mkdir(struct vop_mkdir_args *ap) struct vattr vattr; struct nfsfh *nfhp; struct nfsvattr nfsva, dnfsva; + struct timespec ts; int error = 0, attrflag, dattrflag, ret; if ((error = VOP_GETATTR(dvp, vattr, cnp-cn_cred)) != 0) @@ -2116,8 +2123,10 @@ nfs_mkdir(struct vop_mkdir_args *ap) *
Re: FreeBSD 9 recompile ports
On Fri, Jan 13, 2012 at 04:11:22PM +0200, Andriy Gapon wrote: on 13/01/2012 14:57 George Kontostanos said the following: Still the question remains regarding COMPAT_FREEBSD8 and how does this affects ports/misc/compat8x/ Looks like all the previous hints have not been clear enough. There is no direct relation between COMPAT_FREEBSD8 and misc/compat8x. COMPAT_FREEBSDX options are only needed when going from release X to release X+1 there was a change to an existing system call at the kernel-userland boundary. A side note: kernel options affect only what's in the kernel, quite obviously. misc/compatXx contains versions of shared libraries from release X that are no longer present in X+1. Additional twist is that not every change at the kernel/usermode boundary is covered with backward-compatibility shims. Recent example is the CAM ABI change, which makes libcam.so.5 from the compat8x useless. pgpuaqFjafo46.pgp Description: PGP signature
Re: CD9660/md(4)/UFS22 silly behaviour
On Sun, Jan 08, 2012 at 10:21:50PM +, Poul-Henning Kamp wrote: I'm doing som data-mining on a pile of ISO images right now. I stuck the ISOs on a UFS2 on a flash-disk for speed, and mdconfig(8)'d them so I could mount them. The traffic pattern his interesting: dT: 1.003s w: 1.000s L(q) ops/sr/s kBps ms/rw/s kBps ms/w %busy Name [...] 1733733 14661.3 0 00.0 98.2| md39 1733733 234491.3 0 00.0 93.2| da0 Notice the 1:16 ratio on kBps but 1:1 ratio on ops/s ? da0's UFS2 has 32k block-size: magic 19540119 (UFS2) timeWed Jan 4 16:41:47 2012 superblock location 65536 id [ 4f046cf5 c30697ee ] ncg 104 size19537685blocks 19228156 bsize 32768 shift 15 mask0x8000 fsize 4096shift 12 mask0xf000 [...] It looks like every 2k read from CD9660 turns into a 32k block read in the UFS filesystem, without any beneficial caching happening. Less than optimal I'd say... What is the access patern ? Is it random access, or sequential read (from the cd9660 POV) ? pgpj4AQGx2y89.pgp Description: PGP signature
Re: CD9660/md(4)/UFS22 silly behaviour
On Sun, Jan 08, 2012 at 10:31:06PM +, Poul-Henning Kamp wrote: In message 20120108222720.gn31...@deviant.kiev.zoral.com.ua, Kostik Belousov writes: What is the access patern ? Is it random access, or sequential read (from the cd9660 POV) ? Random access to files in the CD9660 filesystem, which stores files in sequential 2K blocks. Then it is reasonable. UFS reads full blocks. If you want/plan to use UFS volume for small reads exclusively, you can newfs it with much smaller block size, e.g. 8KB or even 4KB. pgp2u3DjXRzwg.pgp Description: PGP signature
Re: CD9660/md(4)/UFS22 silly behaviour
On Sun, Jan 08, 2012 at 04:09:00PM -0800, Don Lewis wrote: On 8 Jan, Poul-Henning Kamp wrote: I'm doing som data-mining on a pile of ISO images right now. I stuck the ISOs on a UFS2 on a flash-disk for speed, and mdconfig(8)'d them so I could mount them. The traffic pattern his interesting: dT: 1.003s w: 1.000s L(q) ops/sr/s kBps ms/rw/s kBps ms/w %busy Name [...] 1733733 14661.3 0 00.0 98.2| md39 1733733 234491.3 0 00.0 93.2| da0 Notice the 1:16 ratio on kBps but 1:1 ratio on ops/s ? da0's UFS2 has 32k block-size: magic 19540119 (UFS2) timeWed Jan 4 16:41:47 2012 superblock location 65536 id [ 4f046cf5 c30697ee ] ncg 104 size19537685blocks 19228156 bsize 32768 shift 15 mask0x8000 fsize 4096shift 12 mask0xf000 [...] It looks like every 2k read from CD9660 turns into a 32k block read in the UFS filesystem, without any beneficial caching happening. Probably some confusion about which filesystem layer owns the cached data. It would probably be inefficient to cache the data in both places. The best fix would probably be for CD9660 to think that the underlying device has 32Kb sectors. I discussed the issue with phk further. The reason for discarding the 30K of the read 32K block is that md(4) supplies IO_DIRECT flag for VOP_READ, and FFS avoids putting the read data into any cache. Most likely, we can implement a sysctl that would disable direct reads, at the cost of double-buffering the data. For obvious reasons, it is impossible to disable caching from the filesystem living on top of md(4) volume. pgpuHe11Cxjyv.pgp Description: PGP signature
Re: dogfooding over in clusteradm land
On Tue, Jan 03, 2012 at 12:02:22AM -0800, Don Lewis wrote: On 2 Jan, Don Lewis wrote: On 2 Jan, Don Lewis wrote: On 2 Jan, Florian Smeets wrote: This does not make a difference. I tried on 32K/4K with/without journal and on 16K/2K all exhibit the same problem. At some point during the cvs2svn conversion the sycer starts to use 100% CPU. The whole process hangs at that point sometimes for hours, from time to time it does continue doing some work, but really really slow. It's usually between revision 21 and 22, when the resulting svn file gets bigger than about 11-12Gb. At that point an ls in the target dir hangs in state ufs. I broke into ddb and ran all commands which i thought could be useful. The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt Tracing command syncer pid 9 tid 100183 td 0xfe00120e9000 cpustop_handler() at cpustop_handler+0x2b ipi_nmi_handler() at ipi_nmi_handler+0x50 trap() at trap+0x1a8 nmi_calltrap() at nmi_calltrap+0x8 --- trap 0x13, rip = 0x8082ba43, rsp = 0xff8000270fe0, rbp = 0xff88c97829a0 --- _mtx_assert() at _mtx_assert+0x13 pmap_remove_write() at pmap_remove_write+0x38 vm_object_page_remove_write() at vm_object_page_remove_write+0x1f vm_object_page_clean() at vm_object_page_clean+0x14d vfs_msync() at vfs_msync+0xf1 sync_fsync() at sync_fsync+0x12a sync_vnode() at sync_vnode+0x157 sched_sync() at sched_sync+0x1d1 fork_exit() at fork_exit+0x135 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xff88c9782d00, rbp = 0 --- I thinks this explains why the r228838 patch seems to help the problem. Instead of an application call to msync(), you're getting bitten by the syncer doing the equivalent. I don't know why the syncer is CPU bound, though. From my understanding of the patch it only optimizes the I/O. Without the patch, I would expect that the syncer would just spend a lot of time waiting on I/O. My guess is that this is actually a vm problem. There are nested loops in vm_object_page_clean() and vm_object_page_remove_write(), so you could be doing something that's causing lots of looping in that code. Does the machine recover if you suspend cvs2svn? I think what is happening is that cvs2svn is continuing to dirty pages while the syncer is trying to sync the file. From my limited understanding of this code, it looks to me like every time cvs2svn dirties a page, it will trigger a call to vm_object_set_writeable_dirty(), which will increment object-generation. Whenever vm_object_page_clean() detects a change in the generation count, it restarts its scan of the pages associated with the object. This is probably not optimal ... Since the syncer is only trying to flush out pages that have been dirty for the last 30 seconds, I think that vm_object_set_writeable_dirty() should just make one pass through the object, ignoring generation, and then return when it is called from the syncer. That should keep vm_object_set_writeable_dirty() from looping over the object again and again if another process is actively dirtying the object. This sounds very plausible. I think that there is no sense in restarting the scan if it is requested in async mode at all. See below. Would be thrilled if this finally solves the svn2cvs issues. commit 41aaafe5e3be5387949f303b8766da64ee4a521f Author: Kostik Belousov kostik@sirion Date: Tue Jan 3 11:16:30 2012 +0200 Do not restart the scan in vm_object_page_clean() if requested mode is async. Proposed by:truckman diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 716916f..52fc08b 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -841,7 +841,8 @@ rescan: if (p-valid == 0) continue; if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) { - if (object-generation != curgeneration) + if ((flags OBJPC_SYNC) != 0 + object-generation != curgeneration) goto rescan; np = vm_page_find_least(object, pi); continue; @@ -851,7 +852,8 @@ rescan: n = vm_object_page_collect_flush(object, p, pagerflags, flags, clearobjflags); - if (object-generation != curgeneration) + if ((flags OBJPC_SYNC) != 0 + object-generation != curgeneration) goto rescan; /* pgpCaQxKBZews.pgp Description: PGP signature
Re: dogfooding over in clusteradm land
On Tue, Jan 03, 2012 at 01:45:26AM -0800, Don Lewis wrote: On 3 Jan, Kostik Belousov wrote: This sounds very plausible. I think that there is no sense in restarting the scan if it is requested in async mode at all. See below. Would be thrilled if this finally solves the svn2cvs issues. commit 41aaafe5e3be5387949f303b8766da64ee4a521f Author: Kostik Belousov kostik@sirion Date: Tue Jan 3 11:16:30 2012 +0200 Do not restart the scan in vm_object_page_clean() if requested mode is async. Proposed by:truckman diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 716916f..52fc08b 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -841,7 +841,8 @@ rescan: if (p-valid == 0) continue; if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) { - if (object-generation != curgeneration) + if ((flags OBJPC_SYNC) != 0 + object-generation != curgeneration) goto rescan; np = vm_page_find_least(object, pi); continue; I wonder if it would make more sense to just skip the busy pages in async mode instead of sleeping ... It would be too much weakening the guarantee of the vfs_msync(MNT_NOWAIT) to not write such pages, IMO. Busy state indeed means that the page most likely undergoing the i/o, but in case it is not, we would not write it at all. Lets see whether the change alone helps. Do you agree ? pgpsejHYZyDCu.pgp Description: PGP signature
Re: dogfooding over in clusteradm land
On Tue, Jan 03, 2012 at 02:57:17AM -0800, Don Lewis wrote: On 3 Jan, Kostik Belousov wrote: On Tue, Jan 03, 2012 at 01:45:26AM -0800, Don Lewis wrote: On 3 Jan, Kostik Belousov wrote: This sounds very plausible. I think that there is no sense in restarting the scan if it is requested in async mode at all. See below. Would be thrilled if this finally solves the svn2cvs issues. commit 41aaafe5e3be5387949f303b8766da64ee4a521f Author: Kostik Belousov kostik@sirion Date: Tue Jan 3 11:16:30 2012 +0200 Do not restart the scan in vm_object_page_clean() if requested mode is async. Proposed by: truckman diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 716916f..52fc08b 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -841,7 +841,8 @@ rescan: if (p-valid == 0) continue; if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) { -if (object-generation != curgeneration) +if ((flags OBJPC_SYNC) != 0 +object-generation != curgeneration) goto rescan; np = vm_page_find_least(object, pi); continue; I wonder if it would make more sense to just skip the busy pages in async mode instead of sleeping ... It would be too much weakening the guarantee of the vfs_msync(MNT_NOWAIT) to not write such pages, IMO. Busy state indeed means that the page most likely undergoing the i/o, but in case it is not, we would not write it at all. If the original code detects a busy page, it sleeps and then continues with the next page if generation hasn't changed. If generation has changed, then it restarts the scan. With your change above, the code will skip the busy page after sleeping if it is running in async mode. It won't make another attempt to write this page because it no longer attempts to rescan. Why would it skip it ? Please note the call to vm_page_find_least() with the pindex of the busy page right after the check for generation. If a page with the pindex is still present in the object, vm_page_find_least() should return it, and vm_object_page_clean() should make another attempt at processing it. Am I missing something ? My suggestion just omits the sleep in this particular case. The syncer should write the page the next time it runs, unless we're particularly unlucky ... Lets see whether the change alone helps. Do you agree ? Your patch is definitely worth trying as-is. My latest suggestion is probably a minor additional optimization. pgpe8aTZBd2Ul.pgp Description: PGP signature
Re: dogfooding over in clusteradm land
On Mon, Jan 02, 2012 at 12:47:03PM -0800, Don Lewis wrote: On 2 Jan, Florian Smeets wrote: On 29.12.11 01:04, Kirk McKusick wrote: Rather than changing BKVASIZE, I would try running the cvs2svn conversion on a 16K/2K filesystem and see if that sorts out the problem. If it does, it tells us that doubling the main block size and reducing the number of buffers by half is the problem. If that is the problem, then we will have to increase the KVM allocated to the buffer cache. This does not make a difference. I tried on 32K/4K with/without journal and on 16K/2K all exhibit the same problem. At some point during the cvs2svn conversion the sycer starts to use 100% CPU. The whole process hangs at that point sometimes for hours, from time to time it does continue doing some work, but really really slow. It's usually between revision 21 and 22, when the resulting svn file gets bigger than about 11-12Gb. At that point an ls in the target dir hangs in state ufs. I broke into ddb and ran all commands which i thought could be useful. The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt Tracing command syncer pid 9 tid 100183 td 0xfe00120e9000 cpustop_handler() at cpustop_handler+0x2b ipi_nmi_handler() at ipi_nmi_handler+0x50 trap() at trap+0x1a8 nmi_calltrap() at nmi_calltrap+0x8 --- trap 0x13, rip = 0x8082ba43, rsp = 0xff8000270fe0, rbp = 0xff88c97829a0 --- _mtx_assert() at _mtx_assert+0x13 pmap_remove_write() at pmap_remove_write+0x38 vm_object_page_remove_write() at vm_object_page_remove_write+0x1f vm_object_page_clean() at vm_object_page_clean+0x14d vfs_msync() at vfs_msync+0xf1 sync_fsync() at sync_fsync+0x12a sync_vnode() at sync_vnode+0x157 sched_sync() at sched_sync+0x1d1 fork_exit() at fork_exit+0x135 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xff88c9782d00, rbp = 0 --- I thinks this explains why the r228838 patch seems to help the problem. Instead of an application call to msync(), you're getting bitten by the syncer doing the equivalent. I don't know why the syncer is CPU bound, though. From my understanding of the patch it only optimizes the I/O. Without the patch, I would expect that the syncer would just spend a lot of time waiting on I/O. My guess is that this is actually a vm problem. There are nested loops in vm_object_page_clean() and vm_object_page_remove_write(), so you could be doing something that's causing lots of looping in that code. r228838 allows the system to skip 50-70% of the code when initiating a write of the UFS file page, due to async clustering. The system has to maintain 75% less amount of writes in progress. I think that ls is hanging because it's stumbling across the vnode that the syncer has locked. This is the only reasonable explanation. Low-tech profile is to periodically break out into ddb and do backtrace for the syncer thread. More advanced techniques is to use dtrace or normal profiling. pgpqweDffT9HY.pgp Description: PGP signature
Re: lang/lua: /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_' can not be used when making a shared object; recompile with -fPIC
On Thu, Dec 29, 2011 at 12:19:40PM +0100, O. Hartmann wrote: Am 12/29/11 11:48, schrieb Rainer Hurling: On 28.12.2011 19:31 (UTC+1), Kostik Belousov wrote: On Wed, Dec 28, 2011 at 07:21:00PM +0100, O. Hartmann wrote: Am 12/28/11 19:10, schrieb Ed Schouten: * Rainer Hurlingrhur...@gwdg.de, 20111228 17:31: error: macro _Static_assert passed 3 arguments, but takes just 2 In file included from /usr/ports/lang/gcc46/work/gcc-4.6-20111209/libstdc++-v3/include/precompiled/stdc++.h:103:0: Hmmm... This seems to apply to my changes. I will look into this tomorrow. Thanks for the report! Be aware that the error produced by the linker I mentioned in the initial post occurs on FreeBSD 10 as well as FreeBSD 9.0. I already filed a PR about the problem of a non compiling lang/gcc46 today (ports/163672: lang/gcc46: make failed for lang/gcc46). For test puproses, I rebuild gcc46 on our FreeBSD 9.0 boxes - without any problem. I guess, the commit r228902 has been done to FreeBSD 10.0 and not 9.0. Obviously, linker error during the compilation of third-party software has nothing to do with compiler error occuring when building gcc. Do people ever read the texts of the messages ? Kostik, probably you are right. I had read the messages, but there are some strange errors with gcc46 on head for two days now, which leaded me in the wrong direction. So sorry for erroneously 'hijacking' this thread with another problem most certain only existing in head. I found another trail, which hopefully is more usefull for solving the problem Oliver described. Whe I try to build lang/lua I get this error: [..snip..] cc -o liblua.so -O2 -fno-strict-aliasing -pipe -msse3 -Wall -DLUA_USE_LINUX -shared -Wl,-soname=liblua-5.1.so.1 lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o ar rcu liblua.a lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o ranlib liblua.a cc -o lua lua.o liblua.a -lm -Wl,-E -lreadline cc -o luac luac.o print.o liblua.a -lm -Wl,-E -lreadline /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_' can not be used when making a shared object; recompile with -fPIC lapi.o: could not read symbols: Bad value *** Error code 1 It also gives a linker error, almost the same relocation is named. This does only happen with option '-msse3' enabled in /etc/make.conf: CFLAGS= -O2 -fno-strict-aliasing -pipe -msse3 Using CLFAGS without -msse3 (default) works well: CFLAGS= -O2 -fno-strict-aliasing -pipe The systems processor, were this happens, is a CPU: AMD Phenom(tm) II X6 1090T Processor (3214.32-MHz K8-class CPU) Origin = AuthenticAMD Id = 0x100fa0 Family = 10 Model = a Stepping = 0 Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT Features2=0x802009SSE3,MON,CX16,POPCNT AMD Features=0xee500800SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow! AMD Features2=0x37ffLAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT TSC: P-state invariant, performance statistics FreeBSD 10-CURRENT (amd64) r228920 In hope of a more belonging posting, Rainer ___ WOW! I tried lang/lua on FreeBSD 9.0-PRE and see exactly this error: clang -O2 -fno-strict-aliasing -pipe -march=core2 -Wall -DLUA_USE_LINUX -c print.c clang -o liblua.so -O2 -fno-strict-aliasing -pipe -march=core2 -Wall -DLUA_USE_LINUX -shared -Wl,-soname=liblua-5.1.so.1 lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_' can not be used when making a shared object; recompile with -fPIC lapi.o: could not read symbols: Bad value clang: error: linker command failed with exit code 1 (use -v to see invocation) *** Error code 1 ar rcu liblua.a lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o ranlib liblua.a 1 error *** Error code 2 1 error *** Error code 2 1 error *** Error code 1 Stop in /usr/ports/lang/lua. === make failed for lang/lua
Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value
On Wed, Dec 28, 2011 at 11:48:28AM +0100, O. Hartmann wrote: Hello out here. I run into a problem since one of the last portupdates and I do not know whether this has to do with binutils or gcc46 or even FreeBSD 9.0/10.0 AMD64. Background: We use a scientific graphical toolset for planetary research called ISIS3, which is provided by the USGS. We patched ISIS3 to run on FreeBSD 8/9/10 so far and it ran well with FreeBSD 8.2-STABLE and 9.0-PRE a couple of weeks ago. On all of my boxes, I do frequently a portupgrade, so I saw binutils got bumped up and gcc 4.6 is also getting really frequently changed these days. After a some portupdates within the last weeks, I just decided to compile ISIS3 again to keep it fresh and on track, but it won't compile anymore. On all FreeBSD 9.0-PRERELEASE and FreeBSD 10.0-CURRENT (all AMD64 and CLANG built) I receive in some subfolders containing sources the follwoing error: [...] Adding API object [UniqueIOCachingAlgorithm] Adding API object [UniversalGroundMap] Adding API object [UserInterface] Adding API object [VariableLineScanCameraDetectorMap] Adding API object [VecFilter] Adding API object [WorldMapper] Adding API object [iException] Adding API object [iString] Adding API object [iTime] Working on Package [mex] (11:30:15) Adding API object [HrscCamera] /usr/local/bin/ld: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a(functexcept.o): relocation R_X86_64_32 against `std::bad_exception::~bad_exception()' can not be used when making a shared object; recompile with -fPIC /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value collect2: ld returned 1 exit status gmake[5]: *** [plugin] Error 1 cp: libHrscCamera.so: No such file or directory gmake[4]: *** [install] Error 1 The error is completely clear as it is: the build tries to link static library libstdc++.so into shared object. This is not supported. pgpqYjyBtTIwA.pgp Description: PGP signature
Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value
On Wed, Dec 28, 2011 at 03:10:12PM +0100, O. Hartmann wrote: Am 12/28/11 14:58, schrieb Kostik Belousov: On Wed, Dec 28, 2011 at 11:48:28AM +0100, O. Hartmann wrote: Hello out here. I run into a problem since one of the last portupdates and I do not know whether this has to do with binutils or gcc46 or even FreeBSD 9.0/10.0 AMD64. Background: We use a scientific graphical toolset for planetary research called ISIS3, which is provided by the USGS. We patched ISIS3 to run on FreeBSD 8/9/10 so far and it ran well with FreeBSD 8.2-STABLE and 9.0-PRE a couple of weeks ago. On all of my boxes, I do frequently a portupgrade, so I saw binutils got bumped up and gcc 4.6 is also getting really frequently changed these days. After a some portupdates within the last weeks, I just decided to compile ISIS3 again to keep it fresh and on track, but it won't compile anymore. On all FreeBSD 9.0-PRERELEASE and FreeBSD 10.0-CURRENT (all AMD64 and CLANG built) I receive in some subfolders containing sources the follwoing error: [...] Adding API object [UniqueIOCachingAlgorithm] Adding API object [UniversalGroundMap] Adding API object [UserInterface] Adding API object [VariableLineScanCameraDetectorMap] Adding API object [VecFilter] Adding API object [WorldMapper] Adding API object [iException] Adding API object [iString] Adding API object [iTime] Working on Package [mex] (11:30:15) Adding API object [HrscCamera] /usr/local/bin/ld: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a(functexcept.o): relocation R_X86_64_32 against `std::bad_exception::~bad_exception()' can not be used when making a shared object; recompile with -fPIC /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value collect2: ld returned 1 exit status gmake[5]: *** [plugin] Error 1 cp: libHrscCamera.so: No such file or directory gmake[4]: *** [install] Error 1 The error is completely clear as it is: the build tries to link static library libstdc++.so into shared object. This is not supported. Thanks, Kostik, for the fast response. The error isn't so clear to me, sorry. I thought libstdc++.a is the static library and it is taken to be referenced/compiled into a shared Linked in. object created by the application I try to compile. Right, and this is not supported. Code linked into shared object must be compiled PIC. An .a library usually does not contain objects compiled by PIC, ld just dutifully reported back. I'm much more confused now, since I thought the last time I compiled that piece of software, I never got any error like that. Well, clang fails with some obscure errors on the code itself and I'm unwilling to correct them, I'll try the legacy gcc 4.2.1 and will report what's happening. It might have worked by accident (because libstdc++.a objects referenced during the link did not carried unsupported relocations), or, much more likely, the build system has changed and started doing stupid things. It must not link static libraries into shared objects. You should examine why it does this, and fix it. Changing compilers is just wasting a time. pgpkHA6aHIGk1.pgp Description: PGP signature
Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value
On Wed, Dec 28, 2011 at 07:21:00PM +0100, O. Hartmann wrote: Am 12/28/11 19:10, schrieb Ed Schouten: * Rainer Hurling rhur...@gwdg.de, 20111228 17:31: error: macro _Static_assert passed 3 arguments, but takes just 2 In file included from /usr/ports/lang/gcc46/work/gcc-4.6-20111209/libstdc++-v3/include/precompiled/stdc++.h:103:0: Hmmm... This seems to apply to my changes. I will look into this tomorrow. Thanks for the report! Be aware that the error produced by the linker I mentioned in the initial post occurs on FreeBSD 10 as well as FreeBSD 9.0. I already filed a PR about the problem of a non compiling lang/gcc46 today (ports/163672: lang/gcc46: make failed for lang/gcc46). For test puproses, I rebuild gcc46 on our FreeBSD 9.0 boxes - without any problem. I guess, the commit r228902 has been done to FreeBSD 10.0 and not 9.0. Obviously, linker error during the compilation of third-party software has nothing to do with compiler error occuring when building gcc. Do people ever read the texts of the messages ? pgpXZQ7ftuzPn.pgp Description: PGP signature
Re: SU+J systems do not fsck themselves
On Wed, Dec 28, 2011 at 11:14:19AM -0800, m...@freebsd.org wrote: SU doesn't care about write ordering, as long as everything before a BIO_FLUSH is really flushed by the time the BIO_FLUSH is acknowledged. No. SU and SU+J only require that write completed notification is issued when geom/driver layer guarantees that the block content is written to the stable storage. SU does not depend on non-reordering of writes in any way, as well as it does not issue BIO_FLUSH. pgpkZdtAOt9Lt.pgp Description: PGP signature
Re: [patch] Cleaning up amd64 kernel optimization options
On Fri, Dec 23, 2011 at 06:03:42PM +0100, Dimitry Andric wrote: On 2011-12-23 17:00, Alexander Best wrote: ... Back in the 7.x days, I ran into some code that wasn't easily to debug because the compiler optimized things out with -O2 by inlining and otherwise shifting around code, so setting breakpoints in gdb became difficult. So from that point on I've gotten into the habit of doing -O explicitly in make.conf if DEBUG_FLAGS was specified. Just a thought.. I still leave -O2 in, but what I do is this: make DEBUG_FLAGS=-g -fno-inline would making -O2 -fno-inline the default flags introduce any major slowdown? Not major, but a minor slowdown is quite possible, especially on arches like x86, where call overhead is relatively high, and code caches are relatively large. Anyway, I'd rather get the thread back to my original patch instead of messing around with the default flags for release, which have been -O2 for a long time now. If people want to override those for specific reasons, they can always set COPTFLAGS or DEBUG_FLAGS manually, like John shows. The only thing my patch makes sure of, is that amd64 does the same thing as all other arches, e.g.: compile with a low optimization settings for debug (-O, which is equivalent to -O1), compile with arch-specific high optimization settings for release (-O2 plus whatever is required for the arch, or lower if optimization breaks things). Release is built with -g for long time, this is where the symbol files in /boot/kernel comes from. pgp5KdmN3MSqu.pgp Description: PGP signature
Re: [patch] Cleaning up amd64 kernel optimization options
On Fri, Dec 23, 2011 at 08:04:32PM +0100, Dimitry Andric wrote: On 2011-12-23 18:55, Kostik Belousov wrote: On Fri, Dec 23, 2011 at 06:03:42PM +0100, Dimitry Andric wrote: ... The only thing my patch makes sure of, is that amd64 does the same thing as all other arches, e.g.: compile with a low optimization settings for debug (-O, which is equivalent to -O1), compile with arch-specific high optimization settings for release (-O2 plus whatever is required for the arch, or lower if optimization breaks things). Release is built with -g for long time, this is where the symbol files in /boot/kernel comes from. Ah, that is done via 'makeoptions DEBUG=-g' in the kernel configuration file, right? I didn't realize that was kept in for a release. But even in that case, amd64 is somehow different from the other arches, which all get compiled with -O instead. Yes. If people prefer that to stay as it is, I'll change the diff so only -frename-registers gets removed when clang is used, as clang does not support this flag. This question cannot be answered without measurement. I think that even the 'default' benchmark of buildworld over -O and -O2 kernels can be useful to continue the discussion. pgpCNEdA1st9j.pgp Description: PGP signature
Re: extattr_set_*() return type
On Wed, Dec 21, 2011 at 10:31:11AM -0500, John Baldwin wrote: On Tuesday, December 20, 2011 5:18:58 pm m...@freebsd.org wrote: On Tue, Dec 20, 2011 at 1:49 PM, John Baldwin j...@freebsd.org wrote: Hmm, if these functions are expected to operate like 'write(2)' and are supposed to return the number of bytes written, shouldn't their return value be 'ssize_t' instead of 'int'? It looks like the system calls themselves already do the right thing in setting td_retval[] (they assign a ssize_t to it and td_retval[0] can hold a ssize_t on all of our current platforms). It would seem that the only change would be to the header and probably syscalls.master. I guess this would require a symver bump to fix though. An extended attribute larger than 2GB is a programming abuse, though. Technically int may not be 32 bits but it is on all supported platforms now. Today it is an abuse. In the 90's a 64-bit off_t was considered an abuse by some. :) The type should match the documented behavior. On OS X the set operation doesn't return a size but instead returns a simple success/failure (0 or -1) for which an int is appropriate. However, the FreeBSD API documents that it operates like write and consumes the buffer. Note that the size of the buffer passed to the 'set' and 'get' operations is a size_t, not an int, and the 'get' operations already return a ssize_t, not an int. Note that read(2)/write(2) do return int. I still have WIP patch to fix this, but after some conversations with Bruce I am not sure it is worth finishing. pgpGVG18c7Kk7.pgp Description: PGP signature
Re: extattr_set_*() return type
On Wed, Dec 21, 2011 at 12:25:18PM -0500, John Baldwin wrote: On Wednesday, December 21, 2011 11:13:10 am Kostik Belousov wrote: On Wed, Dec 21, 2011 at 10:31:11AM -0500, John Baldwin wrote: On Tuesday, December 20, 2011 5:18:58 pm m...@freebsd.org wrote: On Tue, Dec 20, 2011 at 1:49 PM, John Baldwin j...@freebsd.org wrote: Hmm, if these functions are expected to operate like 'write(2)' and are supposed to return the number of bytes written, shouldn't their return value be 'ssize_t' instead of 'int'? It looks like the system calls themselves already do the right thing in setting td_retval[] (they assign a ssize_t to it and td_retval[0] can hold a ssize_t on all of our current platforms). It would seem that the only change would be to the header and probably syscalls.master. I guess this would require a symver bump to fix though. An extended attribute larger than 2GB is a programming abuse, though. Technically int may not be 32 bits but it is on all supported platforms now. Today it is an abuse. In the 90's a 64-bit off_t was considered an abuse by some. :) The type should match the documented behavior. On OS X the set operation doesn't return a size but instead returns a simple success/failure (0 or -1) for which an int is appropriate. However, the FreeBSD API documents that it operates like write and consumes the buffer. Note that the size of the buffer passed to the 'set' and 'get' operations is a size_t, not an int, and the 'get' operations already return a ssize_t, not an int. Note that read(2)/write(2) do return int. I still have WIP patch to fix this, but after some conversations with Bruce I am not sure it is worth finishing. The manpages and /usr/include/unistd.h claim they return ssize_t. Is this related to the changes to make uio_resid a size_t (I thought that went into the tree)? If the problem is that the values read/write return may fall into the range of only an int even on 64-bit platforms, that is different from the return type which is part of the ABI. Yes, it is related. The type change for uio was done in advance. Take a look at the first statement of sys_read() and sys_write(): if (uap-nbyte INT_MAX) return (EINVAL); and at the copyinio(), which is used by scatter/gather versions of i/o syscalls to copy in uiovec: if (iov-iov_len INT_MAX - uio-uio_resid) { free(uio, M_IOV); return (EINVAL); pgp0QFlBvfWRz.pgp Description: PGP signature
Re: calling all fs experts
On Sat, Dec 10, 2011 at 05:42:01PM -0800, Maksim Yevmenkin wrote: Hello, i have a question for fs wizards. suppose i can persuade modern spinning disk to do large reads (say 512K to 1M) at a time. also, suppose file system on such modern spinning drive is used to store large files (tens to hundreds of megabytes). is there any way i can tweak the file system parameters (block size, layout, etc) to help it to get as close to disk's sequential read rate as possible. I understand that i will not be able to get 100MB/sec single client sequential read rate, but, can i get it into sustained 40-50MB/sec rate? also, can i reduce performance impact caused by small reads such as directory access etc. If you wanted to get responses from experts only, sorry in advance. The fs (AKA UFS) uses clustering provided by the block cache. The clustering code, mainly located in the kern/vfs_cluster.c, coalesces sequence of reads or writes that are targeting the consequtive blocks, into single physical read or write of the maximal size of MAXPHYS. Current definition of MAXPHYS is 128KB. Clustering allows filesystem to improve the layout of the files by calling VOP_REALLOCBLKS() to redo the allocation to make the writing sequence of blocks sequential if it is not. Even if file is not layed out ideally, or the i/o pattern is random, most writes scheduled are asynchronous, and for reads, the system tries to schedule read-aheads for some limited number of blocks. This allows the lower layers, i.e. geom and disk drivers, to optimize the i/o queue to coalesce requests that are consequitive on disk, but not on the queue. BTW, some time ago I was interested in the effect on the fragmentation on UFS, due to some semi-abandoned patch, which could make the fragmentation worse. I wrote the tool that calculated the percentage of non-consequtive spots in the whole filesystem. Apparently, even under the hard load consisting of writing a lot of files under the megabytes in size, UFS managed to keep the number of spots under 2-3% on sufficiently free volume. pgpg2apEuMeNy.pgp Description: PGP signature
Re: Adding bool, true, and false to the kernel
On Thu, Dec 08, 2011 at 04:36:58PM -0800, Matthew Fleming wrote: This code should be MFC-able to stable/9 after 9.0 is released. Yes, please make sure to merge it to stable/9. pgpJdscbva0LS.pgp Description: PGP signature
Re: running old binaries on -current
On Sun, Dec 04, 2011 at 08:54:27PM +0100, Yamagi Burmeister wrote: On Fri, 2 Jul 2010 21:44:26 +0300 Kostik Belousov kostik...@gmail.com wrote: On Fri, Jul 02, 2010 at 11:12:13AM -0700, Julian Elischer wrote: every now and then, for fun I run up a chroot of freebsd 1.1. or 1.0 under a chroot. Usually hillarity ensues with teh 15 second kernel compile and the 4 minute make world. in -current I can't do that any more.. any binary just exits with 'Abort'. I think I last tried it in 7.0 or there abouts. I have options COMPAT_AOUT and COMPAT_FREEBSD4 through COMPAT_FREEBSD7 does anyone else have any ideas as to what may be needed? I vaguely remember another option but I am not seeing it at the moment. For those of you who do not remember, 1.0 had a.out static binaries only. Can you ktrace/kdump the run attempt, I suspect that abort is sent by image activator. Also, please put some static binary somewhere to download. Hi, digging around I found one of the first programs ever written. It's an static aout binary from 1994 or so, running it on FreeBSD 8.2 shows exactly this problem. Some more extensive tests narrowed this problem down to FreeBSD 1.x binaries, they're all broken. FreeBSD 2 aout binaries are still working. Since Google found this rather old thread I decided to provide the data your requested, but I do not insist on a fix. I guess that nobody will try to run FreeBSD 1.x binaries these day... So if you want to spend some work on this I'll test patches but if you decide to do nothing it's okay. This is the /bin/sh of FreeBSD 1.0 (taken from the official installation cd image f?r i386): % file sh sh: VAX demand paged pure executable % ./sh Abort The ktrace / kdump: 1065 ktrace RET ktrace 0 1065 ktrace CALL execve(0xbfbfee1b,0xbfbfecf8,0xbfbfed00) 1065 ktrace NAMI ./sh I've uploaded the binary to: http://deponie.yamagi.org/freebsd/tmp/sh_freebsd1 Try to set sysctl security.bsd.map_at_zero to 1. pgpg2AX8U5A4g.pgp Description: PGP signature
Re: r227487 breaks C++ programs that use __isthreaded
On Thu, Dec 01, 2011 at 04:23:11PM -0500, David Schultz wrote: On Thu, Dec 01, 2011, George Liaskos wrote: Hello One example is Google's tcmalloc [1], is this behaviour intended? [1] http://code.google.com/p/google-perftools/source/browse/trunk/src/maybe_threads.cc This code uses an unportable workaround for a bug that I believe was fixed in r227999. Using internal names starting with a double underscore isn't supported. Separately, I'm still hoping that the namespace polution introduced in r227487 gets fixed... I handled the pthread_once mess in libunwind without relying on __isthreaded, but I indeed only needed once working. http://git.savannah.gnu.org/cgit/libunwind.git/commit/?id=08077a4962c4e606598f9f0e54b515b3c882be10 pgpg3qjpoFMCo.pgp Description: PGP signature
Re: [patch] turning devctl into a multiple openable device
On Fri, Dec 02, 2011 at 12:17:21AM +0100, Baptiste Daroussin wrote: On Wed, Nov 30, 2011 at 05:20:17PM +0100, Olivier Houchard wrote: On Wed, Nov 30, 2011 at 06:04:50PM +0200, Kostik Belousov wrote: I wonder why the waiting_threads stuff is needed at all. The cv could be woken up unconditionally everytime. What is the reason for the cv_wait call in cdevpriv data destructor ? You cannot have a thread doing e.g. read on the file descriptor while destructor is run. What will prevent you from having a thread stuck in read(), while an another one close() the fd ? Nothing, but file reference count goes to zero only after the thread stuck in read is unstuck. Cdevpriv destructor is run only when file reference count becomes zero, i.e. there can be no any accessing threads, and new accesses are impossible since file descriptors also own references on the file. Right, I was a bit confused, this part can be removed. Regards, Olivier Here is a new version of the patch mostly reworked by Olivier, It doesn't duplicate anymore the devq, and fix all that have been spotted here previously. http://people.freebsd.org/~bapt/devctl_multi_open.diff bonus, it removes the needless giant lock I do not see a need in the use of refcount KPI, since it seems that all manipulations of n-refcount happen under global_devctl.mtx. Am I missed the place ? Releasing refcount on dev_event_info before using it in devread() allows other thread to free the memory while current thread still uses it. Stylish notes: mtx member in struct global_devctl has weird indentation; please move declaration of n2 into the declaration block of devdtor(); devread():should_free is initialized at the declaration site; devctl_queue_data_f() has stray empty line ater the while { loop line, while need blank line at the start of devctl_queue_data() is removed. pgp5tdbw3yh7H.pgp Description: PGP signature
Re: Incorrect tag= in /usr/src/share/examples/cvsup/sta(ble|ndard) in 9.0-PRERELEASE?
On Thu, Dec 01, 2011 at 12:12:18PM +0300, Sergey Kandaurov wrote: On 1 December 2011 10:20, Milan Obuch freebsd-curr...@dino.sk wrote: On Tue, 29 Nov 2011 19:22:39 +0300 Sergey Kandaurov pluk...@gmail.com wrote: On 29 November 2011 20:16, Maxim Khitrov m...@mxcrypt.com wrote: On Tue, Nov 29, 2011 at 10:30 AM, Sergey Kandaurov pluk...@gmail.com wrote: On 26 November 2011 11:44, Milan Obuch freebsd-curr...@dino.sk wrote: Hi, I am playing a bit with 9.0-PRERELEASE compiling it from source updated via csup. In both example files there is line specifying what to csup *default release=cvs tag=RELENG_8 which is incorrect, I think. It is convenient for me to issue just csup -h cvsup.freebsd.sk /usr/share/examples/cvsup/stable-supfile to update full sources without need to create any cvsup config file, however in system installed from 9.0 snapshot (maybe two weeks old) this file points to version 8 files, so I need to correct it for 9.0-PRERELEASE to not accidentally download older version sources. The same is also true after upgrade from source - make installworld install example files pointing to older version... Is it something I do not know about or is it an oversight? I think this line should already be changed to new tag... *default release=cvs tag=RELENG_9 Hi. Fixed. Thanks for your report. Now cvs tag points to RELENG_9 in 9.x sources. Should standard-supfile also be updated to point to RELENG_9_0? I'm using csup with tag=RELENG_9_0 and standard-supfile still points to HEAD. Yep, sure. I just sent a request to the Release Engineering Team. It works for me now as expected, thanks. Anyway, there is a question what the difference between stable-supfile and standard-supfile should be. I looked in my local csupped sources, they are the same in 6-STABLE (OK, some history here), 7-STABLE, 8-STABLE and 9-STABLE. Are they expected to be used differently? In STABLE branches standard-supfile and stable-supfile are used to have the same cvs tag. FYI, compare how it is done in RELEASE branches. And, second one - what about CURRENT? In stable-supfile I see tag=RELENG_9 which is not quite clear, but just for some pedantry... I use standard-supfile for CURRENT, so this is not an issue for me either. To my knowledge, in CURRENT a standard-supfile's cvs tag should be read as the latest (i.e. the most recently created) stable branch. Could the supfiles be generated from some value in newvers.sh ? pgpWdpj5GwCx0.pgp Description: PGP signature
Re: [patch] turning devctl into a multiple openable device
On Wed, Nov 30, 2011 at 10:05:11AM -0500, John Baldwin wrote: On Wednesday, November 30, 2011 7:43:20 am Baptiste Daroussin wrote: Hi all, With the help of cognet, I wrote a patch to turn devctl into a multiple openable device, that mean that it will allow to open /dev/devctl in multiple programs, for example hald and everythings that want to receive notification from the device won't need to depend on haveing devd running. here is the patch: http://people.freebsd.org/~bapt/devctl_multi_open.diff Shouldn't devctl_queue_data_f() use the requested malloc() flags instead of hardcoding M_NOWAIT? This is an obvious fallback of holding mutex around the call to per_devctl_queue_data_f(), which caused the author a trouble to use M_WAITOK. Having n readers causes the patch to queue each message n times, that looks like a waste. I wonder why the waiting_threads stuff is needed at all. The cv could be woken up unconditionally everytime. What is the reason for the cv_wait call in cdevpriv data destructor ? You cannot have a thread doing e.g. read on the file descriptor while destructor is run. Also, I know that it was an intentional design decisison by Warner to have the multiplexing of devctl data done in userland via devd rather than in the kernel. (I think he envisioned devd providing a UNIX domain socket or some such for other daemons to use to listen to events.) Have you asked him about this change? And I fully agree that doing multiplexing in user mode is the right approach. Not least because you could apply some advanced access control and provide filtering for the consumers. pgpL1GaOrgCj6.pgp Description: PGP signature
Re: [patch] turning devctl into a multiple openable device
On Wed, Nov 30, 2011 at 04:55:21PM +0100, Olivier Houchard wrote: On Wed, Nov 30, 2011 at 05:46:36PM +0200, Kostik Belousov wrote: On Wed, Nov 30, 2011 at 10:05:11AM -0500, John Baldwin wrote: On Wednesday, November 30, 2011 7:43:20 am Baptiste Daroussin wrote: Hi all, With the help of cognet, I wrote a patch to turn devctl into a multiple openable device, that mean that it will allow to open /dev/devctl in multiple programs, for example hald and everythings that want to receive notification from the device won't need to depend on haveing devd running. here is the patch: http://people.freebsd.org/~bapt/devctl_multi_open.diff Shouldn't devctl_queue_data_f() use the requested malloc() flags instead of hardcoding M_NOWAIT? This is an obvious fallback of holding mutex around the call to per_devctl_queue_data_f(), which caused the author a trouble to use M_WAITOK. Having n readers causes the patch to queue each message n times, that looks like a waste. Queuing the message only one time would require to somehow keep a state, to know which thread read which message, and figuring out when to free a message can be an headache. Given I don't think they'll be a lot of readers, I'm not sure it's worth the trouble. I wonder why the waiting_threads stuff is needed at all. The cv could be woken up unconditionally everytime. What is the reason for the cv_wait call in cdevpriv data destructor ? You cannot have a thread doing e.g. read on the file descriptor while destructor is run. What will prevent you from having a thread stuck in read(), while an another one close() the fd ? Nothing, but file reference count goes to zero only after the thread stuck in read is unstuck. Cdevpriv destructor is run only when file reference count becomes zero, i.e. there can be no any accessing threads, and new accesses are impossible since file descriptors also own references on the file. pgpc9tuCURiWe.pgp Description: PGP signature
Re: panic: handle_written_inodeblock: live inodedep
On Fri, Nov 25, 2011 at 09:08:27PM +0400, Gleb Smirnoff wrote: Hi! Running HEAD with KMS patches from kib@, a 15th November snapshot build from kms branch from ~kib/deviant2 repository. The panic message and backtrace look unrelated to KMS, so, I think my kernel can be considered a head from 15-th of November. Panic happened a couple of seconds after system finished booting up to login prompt. I didn't login yet, and X server hadn't been started. Backtrace from gdb: #9 0x8049f5d0 in panic (fmt=Variable fmt is not available. ) at /usr/home/glebius/src/deviant2/sys/kern/kern_shutdown.c:600 #10 0x806a3b19 in softdep_disk_write_complete (bp=0xff80eff76500) at /usr/home/glebius/src/deviant2/sys/ufs/ffs/ffs_softdep.c:11020 #11 0x8051d97c in bufdone_finish (bp=0xff80eff76500) at buf.h:418 ---Type return to continue, or q return to quit--- #12 0x8051dcb8 in bufdone (bp=0xff80eff76500) at /usr/home/glebius/src/deviant2/sys/kern/vfs_bio.c:3328 #13 0x8043ac96 in g_io_schedule_up (tp=Variable tp is not available. ) at /usr/home/glebius/src/deviant2/sys/geom/geom_io.c:679 #14 0x8043b21c in g_up_procbody (arg=Variable arg is not available. ) at /usr/home/glebius/src/deviant2/sys/geom/geom_kern.c:97 #15 0x80472d5f in fork_exit ( callout=0x8043b1c0 g_up_procbody, arg=0x0, frame=0xff8000264c50) at /usr/home/glebius/src/deviant2/sys/kern/kern_fork.c:995 #16 0x806f1a9e in fork_trampoline () at /usr/home/glebius/src/deviant2/sys/amd64/amd64/exception.S:602 I'm running SUJ on all partitions. On the next boot after panic fsck failed to run w/o -y option. I can provide any additional information or share core file. Did you have i/o errors before the panic ? pgpy6ZMCZQDZk.pgp Description: PGP signature
Re: RLIMIT_DATA and malloc(3) use of mmap(2)
On Thu, Nov 24, 2011 at 09:38:19PM +0400, Maxim Konovalov wrote: Hi Kostik, On Wed, 23 Nov 2011, 11:22+0400, Maxim Konovalov wrote: [...] Anyway, the patch needs testers before I will push it forward. [igor's email was corrected] We will test it in out environment and let you know. It seems I don't understand how it works. Here is a test program: http://maxim.int.ru/stuff/malloc_test.c.txt It allocates successfully 64x1MB chunks by malloc(), reallocf() and realloc() with the following command line: MALLOC_OPTIONS=JM limits -d 10m ./malloc_test When we add L flag to the MALLOC_OPTIONS it starts to work strange: It allocates 64MB via malloc() but fails to allocate more than 3MB by realloc() and reallocf(). The non-failing 64MB allocation is easily explained by a bug. I forgot to account for the case when existing map entry was expanded, instead of new entry created due to mmap(2). The other part, in particular, the failure after 3MB, is in fact the correct behaviour. Jemalloc() caches allocation of the pages, and it allocates more then asked in the request. ktrace(1) shows the whole history. Malloc() first allocated 4MB for the needs of libc etc. Then, it allocated 4MB chunk which was used for satisfying the requests of 1-3MB. When the 4MB request came in, the allocation for 8MB was attempted, and failed, since 4MB (libc etc) + 8MB = 12MB data limit. More funny, the result varies from time to time: I have no explanation for this, and cannot reproduce the issue. Look at the ktrace. Overall, the test is quite curious but absolutely unrealistic. Fixed patch is available at http://people.freebsd.org/~kib/misc/map_datalimit.2.patch pgpKGquyhMOov.pgp Description: PGP signature
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers
On Mon, Nov 21, 2011 at 06:39:26PM +0100, Robert Millan wrote: (replying with Debian hat this time) 2011/11/21 Kostik Belousov kostik...@gmail.com: There are some implementations that use FreeBSD kernel, and which could potentially benefit from providing its own value for __FreeBSD_kernel. Actually, we wouldn't be able to provide a different value for the macro (whatever its name). Our compiler simply doesn't know which version of the kernel it is building for. Only the headers do, but if we define it in the headers we'd just use the FreeBSD definitions. Our compiler defines __FreeBSD_kernel__ as an empty macro, I don't expect this will change because unlike with FreeBSD, on Debian there are strong technical limitations to making it a number. If __FreeBSD_kernel__ is to be defined in FreeBSD land, may I suggest that it is defined as an empty macro as well? This covers the vast majority of cases (e.g. like the ones in my initial patch which started this discussion), and it doesn't preclude the possibility that this macro becomes a number without breaking backward compatibility. OTOH once you define it as a number, it becomes relevant whether you did it with #ifndef or with #undef, and so you have to commit to it. Just to make it clear: It's no problem to me if it's defined as a number, but it doesn't help much either. At least from Debian POV it's not worth making a big argument about. It could be a good idea from FreeBSD POV, but please only do this if it's useful to FreeBSD. I am fine with __FreeBSD_kernel being empty, please submit the patch. pgpwSRj8FEROM.pgp Description: PGP signature
RLIMIT_DATA and malloc(3) use of mmap(2)
I was reminded about the patch I wrote for Igor Sysoev some time ago. The issue the patch tries to handle is that jemalloc uses mmap() instead of sbrk() for pages allocation, and thus RLIMIT_DATA limit is no longer effective to put the bounds on the process heap. Since reverting to sbrk for such purpose is worse then the issue itself, I proposed a solution of 'self-restricting malloc'. The patch adds a flag MAP_DATALIMIT to the flags argument of mmap(2), which instructs the system to account the mapping in the RLIMIT_DATA resource count. The malloc(3) also gets new option 'L' to enable passing MAP_DATALIMIT to mmap() when allocating pages. By default, the 'L' option is not activated. Now, if user wants to ensure that process heap is restricted by the ulimit -d and still use mmap() for jemalloc, he supplies the option using any mechanism. The behaviour is voluntaristic, to prevent the trashing use RLIMIT_SWAP. Do people consider the facility useful ? Any comments for the patch itself ? http://people.freebsd.org/~kib/misc/map_datalimit.1.patch pgpX3zgzJhYr5.pgp Description: PGP signature
Re: RLIMIT_DATA and malloc(3) use of mmap(2)
On Tue, Nov 22, 2011 at 07:43:57PM +0400, Maxim Dounin wrote: Hello! On Tue, Nov 22, 2011 at 02:44:10PM +0200, Kostik Belousov wrote: I was reminded about the patch I wrote for Igor Sysoev some time ago. The issue the patch tries to handle is that jemalloc uses mmap() instead of sbrk() for pages allocation, and thus RLIMIT_DATA limit is no longer effective to put the bounds on the process heap. Since reverting to sbrk for such purpose is worse then the issue itself, I proposed a solution of 'self-restricting malloc'. Just a little clarification for others: currently, there is no way to *safely* limit memory usage of a process while using jemalloc with mmap(). The only thing available is RLIMIT_VMEM, but it's not safe as it may be reached on stack grow (leaving no possibility for an application to handle this). The patch adds a flag MAP_DATALIMIT to the flags argument of mmap(2), which instructs the system to account the mapping in the RLIMIT_DATA resource count. The malloc(3) also gets new option 'L' to enable passing MAP_DATALIMIT to mmap() when allocating pages. By default, the 'L' option is not activated. Now, if user wants to ensure that process heap is restricted by the ulimit -d and still use mmap() for jemalloc, he supplies the option using any mechanism. The behaviour is voluntaristic, to prevent the trashing use RLIMIT_SWAP. Do people consider the facility useful ? Yes, at least some way to safely limit process memory usage is certainly needed. It's a bit sad this isn't enabled by default, but it's probably too late for this. RLIMIT_DATA was (almost) a noop for too long and making it work again to limit all memory allocations will break POLA. Any comments for the patch itself ? http://people.freebsd.org/~kib/misc/map_datalimit.1.patch Patch looks ok for me (though I'm not a VM expert). Another possible aproach would be to introduce separate anonymous (private?) mmap limit, this will allow to do essentially the same thing in a bit more consistent (IMHO) manner. This is already done in some form as the per-user swap limit. Converting swap limit to the per-process limit raises the architectural questions, due to shadow chains of the backing vm objects. In particular, we have to account the anonymous memory used for backing the changed pages from the writeable private mapping of the files. Similar issues appear due to fork(). Anyway, the patch needs testers before I will push it forward. pgpJsHif0AqVL.pgp Description: PGP signature
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers
On Mon, Nov 21, 2011 at 01:45:29PM +1100, Bruce Evans wrote: On Sun, 20 Nov 2011, Kostik Belousov wrote: On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote: On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote: I fully agree with an idea that compiler is not an authorative source of the knowledge of the FreeBSD version. Even more, I argue that we shall not rely on compiler for this at all. Ideally, we should be able to build FreeBSD using the stock compilers without local modifications. Thus relying on the symbols defined by compiler, and not the source is the thing to avoid and consistently remove. We must do this to be able to use third-party tooldchain for FreeBSD builds. That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ? And then make more strong wording about other systems that use the macro, e.g. remove 'may' from the kFreeBSD example. Also, please remove the smile from comment. Ok. New patch attached. And the last, question, why not do #ifndef __FreeBSD_kernel__ #define __FreeBSD_kernel__ __FreeBSD_version #endif ? #undef is too big tools tool apply there, IMO. #ifndef is too big to apply here, IMO :-). __FreeBSD_kernel__ is in the implementation namespace, so any previous definition of it is a bug. The #ifndef breaks the warning for this bug. FreeBSD does not need it at all. There are some implementations that use FreeBSD kernel, and which could potentially benefit from providing its own value for __FreeBSD_kernel. I was only tried to be nice for the out-of-tree implementations, it boils down to kFreeBSD project preferences. And why not use FreeBSD style? In KNF, the fields are separated by tabs, not spaces. In FreeBSD style, trailing underscores are not used for names in the implementation namespace, since they have no effect on namespaces. The name __FreeBSD_version is an example of this. Does existing practice require using the name with the trailing underscores? pgpM8lL9ZnvAT.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sun, Nov 20, 2011 at 05:37:33PM +0100, Attilio Rao wrote: 2011/11/18 Attilio Rao atti...@freebsd.org: Please consider: http://www.freebsd.org/~attilio/mutexfileline2.patch This is now committed as r227758,227759, you can update your patch now. Here is it. diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index d592ac0..74e5126 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2843,6 +2843,34 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_KBI(vm_page_t m, const char *file, int line) +{ + + mtx_lock_flags_(vm_page_lockptr(m), 0, file, line); +} + +void +vm_page_unlock_KBI(vm_page_t m, const char *file, int line) +{ + + mtx_unlock_flags_(vm_page_lockptr(m), 0, file, line); +} + +int +vm_page_trylock_KBI(vm_page_t m, const char *file, int line) +{ + + return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line) +{ + + mtx_assert_(vm_page_lockptr(m), a, file, line); +} + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 151710d..fe0295b 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[]; #definePA_LOCK_ASSERT(pa, a) mtx_assert(PA_LOCKPTR(pa), (a)) +#ifdef KLD_MODULE +#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_unlock(m) vm_page_unlock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_trylock(m) vm_page_trylock_KBI((m), LOCK_FILE, LOCK_LINE) +#ifdef INVARIANTS +#definevm_page_lock_assert(m, a) \ +vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE) +#else +#definevm_page_lock_assert(m, a) +#endif +#else /* KLD_MODULE */ #definevm_page_lockptr(m) (PA_LOCKPTR(VM_PAGE_TO_PHYS((m #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m))) #definevm_page_unlock(m) mtx_unlock(vm_page_lockptr((m))) #definevm_page_trylock(m) mtx_trylock(vm_page_lockptr((m))) #definevm_page_lock_assert(m, a) mtx_assert(vm_page_lockptr((m)), (a)) +#endif #definevm_page_queue_free_mtx vm_page_queue_free_lock.data /* @@ -405,6 +417,11 @@ void vm_page_cowfault (vm_page_t); int vm_page_cowsetup(vm_page_t); void vm_page_cowclear (vm_page_t); +void vm_page_lock_KBI(vm_page_t m, const char *file, int line); +void vm_page_unlock_KBI(vm_page_t m, const char *file, int line); +int vm_page_trylock_KBI(vm_page_t m, const char *file, int line); +void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line); + #ifdef INVARIANTS void vm_page_object_lock_assert(vm_page_t m); #defineVM_PAGE_OBJECT_LOCK_ASSERT(m) vm_page_object_lock_assert(m) pgpyNMDth73tZ.pgp Description: PGP signature
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers
On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote: On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote: I fully agree with an idea that compiler is not an authorative source of the knowledge of the FreeBSD version. Even more, I argue that we shall not rely on compiler for this at all. Ideally, we should be able to build FreeBSD using the stock compilers without local modifications. Thus relying on the symbols defined by compiler, and not the source is the thing to avoid and consistently remove. We must do this to be able to use third-party tooldchain for FreeBSD builds. That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ? And then make more strong wording about other systems that use the macro, e.g. remove 'may' from the kFreeBSD example. Also, please remove the smile from comment. Ok. New patch attached. And the last, question, why not do #ifndef __FreeBSD_kernel__ #define __FreeBSD_kernel__ __FreeBSD_version #endif ? #undef is too big tools tool apply there, IMO. pgp1sbRkFtD8N.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sun, Nov 20, 2011 at 07:02:14PM +0100, Attilio Rao wrote: 2011/11/20 Kostik Belousov kostik...@gmail.com: +#define vm_page_lock_assert(m, a) \ + vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE) I think you should put the \ in the last tab and also, for consistency, you may want to use __FILE__ and __LINE__ for assert (or maybe I should also switch mutex.h to use LOCK_FILE and LOCK_LINE at some point?). I never saw the requirement for the backslash. I am consistent with PA_UNLOCK_COND() several lines above. Changed assert to use __FILE/LINE__. +#else +#define vm_page_lock_assert(m, a) +#endif +#else /* KLD_MODULE */ This should be /* !KLD_MODULE */, I guess? Changed. #define vm_page_lockptr(m) This is not defined for the KLD_MODULE case? Yes, explicitely. This was discussed. http://lists.freebsd.org/pipermail/freebsd-current/2011-November/029009.html diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index d592ac0..74e5126 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2843,6 +2843,34 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_KBI(vm_page_t m, const char *file, int line) +{ + + mtx_lock_flags_(vm_page_lockptr(m), 0, file, line); +} + +void +vm_page_unlock_KBI(vm_page_t m, const char *file, int line) +{ + + mtx_unlock_flags_(vm_page_lockptr(m), 0, file, line); +} + +int +vm_page_trylock_KBI(vm_page_t m, const char *file, int line) +{ + + return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line) +{ + + mtx_assert_(vm_page_lockptr(m), a, file, line); +} + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 151710d..1fab735 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[]; #definePA_LOCK_ASSERT(pa, a) mtx_assert(PA_LOCKPTR(pa), (a)) +#ifdef KLD_MODULE +#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_unlock(m) vm_page_unlock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_trylock(m) vm_page_trylock_KBI((m), LOCK_FILE, LOCK_LINE) +#ifdef INVARIANTS +#definevm_page_lock_assert(m, a) \ +vm_page_lock_assert_KBI((m), (a), __FILE__, __LINE__) +#else +#definevm_page_lock_assert(m, a) +#endif +#else /* !KLD_MODULE */ #definevm_page_lockptr(m) (PA_LOCKPTR(VM_PAGE_TO_PHYS((m #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m))) #definevm_page_unlock(m) mtx_unlock(vm_page_lockptr((m))) #definevm_page_trylock(m) mtx_trylock(vm_page_lockptr((m))) #definevm_page_lock_assert(m, a) mtx_assert(vm_page_lockptr((m)), (a)) +#endif #definevm_page_queue_free_mtx vm_page_queue_free_lock.data /* @@ -405,6 +417,11 @@ void vm_page_cowfault (vm_page_t); int vm_page_cowsetup(vm_page_t); void vm_page_cowclear (vm_page_t); +void vm_page_lock_KBI(vm_page_t m, const char *file, int line); +void vm_page_unlock_KBI(vm_page_t m, const char *file, int line); +int vm_page_trylock_KBI(vm_page_t m, const char *file, int line); +void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line); + #ifdef INVARIANTS void vm_page_object_lock_assert(vm_page_t m); #defineVM_PAGE_OBJECT_LOCK_ASSERT(m) vm_page_object_lock_assert(m) pgpRqyzyLnGBT.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sun, Nov 20, 2011 at 08:04:21PM +0100, Attilio Rao wrote: This other patch converts sx to a similar interface which cleans up vm_map.c: http://www.freebsd.org/~attilio/sxfileline.patch What do you think about it? This one only changes the KBI ? Note that _sx suffix is not reserved. pgpCdjEU3nGge.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sun, Nov 20, 2011 at 08:22:38PM +0100, Attilio Rao wrote: 2011/11/20 Kostik Belousov kostik...@gmail.com: On Sun, Nov 20, 2011 at 08:04:21PM +0100, Attilio Rao wrote: This other patch converts sx to a similar interface which cleans up vm_map.c: http://www.freebsd.org/~attilio/sxfileline.patch What do you think about it? This one only changes the KBI ? Note that _sx suffix is not reserved. In which sense? If you want to keep the shim support for KLD (thus the hard path) you will always need to keep an hard function and thus you still need a macro acting as a gate between the 'hard function' (or KLD version, if you prefer) and the fast case, that is where the _ suffix came from. As I see, right now kernel exports e.g. _sx_try_slock() for the hard path. After the patch, it will export sx_try_slock_() for the same purpose. The old modules, which call _sx_try_slock(), cannot be loaded into the patched kernel. Am I reading the patch wrong ? pgperrxtbJjmU.pgp Description: PGP signature
Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers
On Sat, Nov 19, 2011 at 10:32:50AM +0100, Robert Millan wrote: 2011/11/18 Robert Millan r...@freebsd.org: 2011/11/17 John Baldwin j...@freebsd.org: Hmm, I wonder if it's better to use the #ifndef approach rather than #undef so that when compilers are updated to DTRT we will honor their settings? Well, the compiler is supposed to know better than sys/param.h, I gave this a bit more thought The compiler knows about the system it was intended to build for, but sys/param.h *is* part of the system we're building for. It's impossible for sys/param.h to have the wrong information unless it's out of sync with the rest of the headers. As for the compiler, on FreeBSD it's hard for the compiler to be out of sync, because the system (in comparison with others) is tightly packaed and upgrade procedures are well defined. But if you take Debian, for example, we use the same compiler to build 8-STABLE, 9-STABLE and 10-CURRENT kernels. The compiler has no idea which version of FreeBSD the sources it is building come from. I wouldn't put compilers in general in a position where they're forced to know the FreeBSD major version beforehand because sys/param.h will give preference to the information coming from compiler. I propose this alternate patch which derives the major number from __FreeBSD_version instead. I fully agree with an idea that compiler is not an authorative source of the knowledge of the FreeBSD version. Even more, I argue that we shall not rely on compiler for this at all. Ideally, we should be able to build FreeBSD using the stock compilers without local modifications. Thus relying on the symbols defined by compiler, and not the source is the thing to avoid and consistently remove. We must do this to be able to use third-party tooldchain for FreeBSD builds. That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ? And then make more strong wording about other systems that use the macro, e.g. remove 'may' from the kFreeBSD example. Also, please remove the smile from comment. pgpkjFIJ5nrvj.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Fri, Nov 18, 2011 at 11:40:28AM +0100, Attilio Rao wrote: 2011/11/16 Kostik Belousov kostik...@gmail.com: On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote: 2011/11/7 Kostik Belousov kostik...@gmail.com: On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote: Ok. I'll offer one final suggestion. Please consider an alternative suffix to func. Perhaps, kbi or KBI. In other words, something that hints at the function's reason for existing. Sure. Below is the extraction of only vm_page_lock() bits, together with the suggested rename. When Attilio provides the promised simplification of the mutex KPI, this can be reduced. My tentative patch is here: http://www.freebsd.org/~attilio/mutexfileline.patch I need to make more compile testing later, but it already compiles GENERIC + modules fine on HEAD. The patch provides a common entrypoint, option independent, for both fast case and debug/compat case. Additively, it almost entirely fixes the standard violation of the reserved namespace, as you described (the notable exception being the macro used in the fast path, that I want to fix as well, but in a separate commit). Now the file/line couplet can be passed to the _ suffix variant of the flag functions. Yes, this is exactly KPI that I would use when available for the vm_page_lock() patch. eadler@ reviewed the mutex.h comment. Please let me know what you think about it, as long as we agree on the patch I'll commit it. But I also agree with John that imposing large churn due to the elimination of the '__' prefix is too late now. At least it will make the change non-MFCable. Besides, we already lived with the names for 10+ years. I will be happy to have the part of the patch that exports the mtx_XXX_(mtx, file, line) defines which can be used without taking care of LOCK_DEBUG or MUTEX_NOINLINE in the consumer code. Ok, this patch should just add the compat stub: http://www.freebsd.org/~attilio/mutexfileline2.patch Am I right that I would use mtx_lock_(mtx, file, line) etc ? If yes, I am fine with it. I'll make more test-compiling later in the day, if you agree on it I will commit the patch tomorrow. Attilio -- Peace can only be achieved by understanding - A. Einstein pgp0D7fXhQgea.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Fri, Nov 18, 2011 at 11:56:55AM +0100, Attilio Rao wrote: 2011/11/18 Kostik Belousov kostik...@gmail.com: On Fri, Nov 18, 2011 at 11:40:28AM +0100, Attilio Rao wrote: 2011/11/16 Kostik Belousov kostik...@gmail.com: On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote: 2011/11/7 Kostik Belousov kostik...@gmail.com: On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote: Ok. I'll offer one final suggestion. Please consider an alternative suffix to func. Perhaps, kbi or KBI. In other words, something that hints at the function's reason for existing. Sure. Below is the extraction of only vm_page_lock() bits, together with the suggested rename. When Attilio provides the promised simplification of the mutex KPI, this can be reduced. My tentative patch is here: http://www.freebsd.org/~attilio/mutexfileline.patch I need to make more compile testing later, but it already compiles GENERIC + modules fine on HEAD. The patch provides a common entrypoint, option independent, for both fast case and debug/compat case. Additively, it almost entirely fixes the standard violation of the reserved namespace, as you described (the notable exception being the macro used in the fast path, that I want to fix as well, but in a separate commit). Now the file/line couplet can be passed to the _ suffix variant of the flag functions. Yes, this is exactly KPI that I would use when available for the vm_page_lock() patch. eadler@ reviewed the mutex.h comment. Please let me know what you think about it, as long as we agree on the patch I'll commit it. But I also agree with John that imposing large churn due to the elimination of the '__' prefix is too late now. At least it will make the change non-MFCable. Besides, we already lived with the names for 10+ years. I will be happy to have the part of the patch that exports the mtx_XXX_(mtx, file, line) defines which can be used without taking care of LOCK_DEBUG or MUTEX_NOINLINE in the consumer code. Ok, this patch should just add the compat stub: http://www.freebsd.org/~attilio/mutexfileline2.patch Am I right that I would use mtx_lock_(mtx, file, line) etc ? If yes, I am fine with it. Yes that is correct. However, I'm a bit confused on one aspect: would you mind using _mtx_lock_flags() instead? If you don't mind the underscore namespace violation I think I can make a much smaller patch against HEAD for it. _mtx_lock_flags() is fine. The reserved names start with __ or _[A-Z]. Otherwise, the one now posted should be ok. Attilio -- Peace can only be achieved by understanding - A. Einstein pgpMbNbc9HZKI.pgp Description: PGP signature
Re: Stop scheduler on panic
On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: On 17.11.2011 00:21, Andriy Gapon wrote: on 16/11/2011 21:27 Fabian Keil said the following: Kostik Belousovkostik...@gmail.com wrote: I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. I tested the patch trying to get a dump (from the debugger) for kern/162036, which currently results in the double fault reported in: http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html It didn't help, but also didn't make anything worse. Fabian The mi_switch recursion looks very familiar to me: mi_switch() at mi_switch+0x270 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 [several pages of the previous three lines skipped] mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 intr_even_schedule_thread() at intr_event_schedule_thread+0xbb ahci_end_transaction() at ahci_end_transaction+0x398 ahci_ch_intr() at ahci_ch_intr+0x2b5 ahcipoll() at ahcipoll+0x15 xpt_polled_action() at xpt_polled_action+0xf7 In fact I once discussed with jhb this recursion triggered from a different place. To quote myself: avgspinlock_exit - critical_exit - mi_switch - kdb_switch - thread_unlock - spinlock_exit - critical_exit - mi_switch - ... avgin the kdb context avgthis issue seems to be triggered by td_owepreempt being true at the time kdb is entered avgand there of course has to be an initial spinlock_exit call somewhere avgin my case it's because of usb keyboard avgI wonder if it would make sense to clear td_owepreempt right before calling kdb_switch in mi_switch avginstead of in sched_switch() avgclearing td_owepreempt seems like a scheduler-independent operation to me avgor is it better to just skip locking in usb when kdb_active is set avg? The workaround described above should work in this case. Another possibility is to pessimize mtx_unlock_spin() implementations to check SCHEDULER_STOPPED() and to bypass any further actions in that case. But that would add unnecessary overhead to the sunny day code paths. Going further up the stack one can come up with the following proposals: - check SCHEDULER_STOPPED() swi_sched() and return early - do not call swi_sched() from xpt_done() if we somehow know that we are in a polling mode There is no flag in CAM now to indicate polling mode, but if needed, it should not be difficult to add one and not call swi_sched(). I have the following change for eons on my test boxes. Without it, I simply cannot get _any_ dump. diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c index 10b89c7..a38e42f 100644 --- a/sys/cam/cam_xpt.c +++ b/sys/cam/cam_xpt.c @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb) TAILQ_INSERT_TAIL(cam_simq, sim, links); mtx_unlock(cam_simq_lock); sim-flags |= CAM_SIM_ON_DONEQ; - if (first) + if (first panicstr == NULL) swi_sched(cambio_ih, 0); } } pgp3ESGBc0NMm.pgp Description: PGP signature
Re: Stop scheduler on panic
On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote: On 11/17/11 10:15, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: On 17.11.2011 00:21, Andriy Gapon wrote: on 16/11/2011 21:27 Fabian Keil said the following: Kostik Belousovkostik...@gmail.com wrote: I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. I tested the patch trying to get a dump (from the debugger) for kern/162036, which currently results in the double fault reported in: http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html It didn't help, but also didn't make anything worse. Fabian The mi_switch recursion looks very familiar to me: mi_switch() at mi_switch+0x270 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 [several pages of the previous three lines skipped] mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 intr_even_schedule_thread() at intr_event_schedule_thread+0xbb ahci_end_transaction() at ahci_end_transaction+0x398 ahci_ch_intr() at ahci_ch_intr+0x2b5 ahcipoll() at ahcipoll+0x15 xpt_polled_action() at xpt_polled_action+0xf7 In fact I once discussed with jhb this recursion triggered from a different place. To quote myself: avgspinlock_exit - critical_exit - mi_switch - kdb_switch - thread_unlock - spinlock_exit - critical_exit - mi_switch - ... avgin the kdb context avgthis issue seems to be triggered by td_owepreempt being true at the time kdb is entered avgand there of course has to be an initial spinlock_exit call somewhere avgin my case it's because of usb keyboard avgI wonder if it would make sense to clear td_owepreempt right before calling kdb_switch in mi_switch avginstead of in sched_switch() avgclearing td_owepreempt seems like a scheduler-independent operation to me avgor is it better to just skip locking in usb when kdb_active is set avg? The workaround described above should work in this case. Another possibility is to pessimize mtx_unlock_spin() implementations to check SCHEDULER_STOPPED() and to bypass any further actions in that case. But that would add unnecessary overhead to the sunny day code paths. Going further up the stack one can come up with the following proposals: - check SCHEDULER_STOPPED() swi_sched() and return early - do not call swi_sched() from xpt_done() if we somehow know that we are in a polling mode There is no flag in CAM now to indicate polling mode, but if needed, it should not be difficult to add one and not call swi_sched(). I have the following change for eons on my test boxes. Without it, I simply cannot get _any_ dump. diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c index 10b89c7..a38e42f 100644 --- a/sys/cam/cam_xpt.c +++ b/sys/cam/cam_xpt.c @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb) TAILQ_INSERT_TAIL(cam_simq, sim, links); mtx_unlock(cam_simq_lock); sim-flags |= CAM_SIM_ON_DONEQ; - if (first) + if (first panicstr == NULL) swi_sched(cambio_ih, 0); } } That should be OK for kernel dumping. I was thinking about CAM abusing polling not only for dumping. But looking on cases where it does it now, may be it is better to rewrite them instead. So, should I interpret your response as 'Reviewed by ? pgplu0xBQRD1h.pgp Description: PGP signature
Re: Stop scheduler on panic
On Thu, Nov 17, 2011 at 11:29:02AM +0200, Alexander Motin wrote: On 11/17/11 11:06, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote: On 11/17/11 10:15, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: On 17.11.2011 00:21, Andriy Gapon wrote: on 16/11/2011 21:27 Fabian Keil said the following: Kostik Belousovkostik...@gmail.com wrote: I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. I tested the patch trying to get a dump (from the debugger) for kern/162036, which currently results in the double fault reported in: http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html It didn't help, but also didn't make anything worse. Fabian The mi_switch recursion looks very familiar to me: mi_switch() at mi_switch+0x270 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 [several pages of the previous three lines skipped] mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 intr_even_schedule_thread() at intr_event_schedule_thread+0xbb ahci_end_transaction() at ahci_end_transaction+0x398 ahci_ch_intr() at ahci_ch_intr+0x2b5 ahcipoll() at ahcipoll+0x15 xpt_polled_action() at xpt_polled_action+0xf7 In fact I once discussed with jhb this recursion triggered from a different place. To quote myself: avgspinlock_exit - critical_exit - mi_switch - kdb_switch - thread_unlock - spinlock_exit - critical_exit - mi_switch - ... avgin the kdb context avgthis issue seems to be triggered by td_owepreempt being true at the time kdb is entered avgand there of course has to be an initial spinlock_exit call somewhere avgin my case it's because of usb keyboard avgI wonder if it would make sense to clear td_owepreempt right before calling kdb_switch in mi_switch avginstead of in sched_switch() avgclearing td_owepreempt seems like a scheduler-independent operation to me avgor is it better to just skip locking in usb when kdb_active is set avg? The workaround described above should work in this case. Another possibility is to pessimize mtx_unlock_spin() implementations to check SCHEDULER_STOPPED() and to bypass any further actions in that case. But that would add unnecessary overhead to the sunny day code paths. Going further up the stack one can come up with the following proposals: - check SCHEDULER_STOPPED() swi_sched() and return early - do not call swi_sched() from xpt_done() if we somehow know that we are in a polling mode There is no flag in CAM now to indicate polling mode, but if needed, it should not be difficult to add one and not call swi_sched(). I have the following change for eons on my test boxes. Without it, I simply cannot get _any_ dump. diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c index 10b89c7..a38e42f 100644 --- a/sys/cam/cam_xpt.c +++ b/sys/cam/cam_xpt.c @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb) TAILQ_INSERT_TAIL(cam_simq, sim, links); mtx_unlock(cam_simq_lock); sim-flags |= CAM_SIM_ON_DONEQ; - if (first) + if (first panicstr == NULL) swi_sched(cambio_ih, 0); } } That should be OK for kernel dumping. I was thinking about CAM abusing polling not only for dumping. But looking on cases where it does it now, may be it is better to rewrite them instead. So, should I interpret your response as 'Reviewed by ? It feels somehow dirty to me. I don't like these global variables. If you consider it is fine, proceed, I see no much harm. But if not, I can add polling flag to the CAM. Flip a coin for me. :) You promised to add the polling at summer' meeting in Kiev. Will you do it now ? pgpEdTAw2tcfa.pgp Description: PGP signature
Re: Stop scheduler on panic
On Thu, Nov 17, 2011 at 02:40:45PM +0200, Alexander Motin wrote: On 11/17/11 13:14, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 11:29:02AM +0200, Alexander Motin wrote: On 11/17/11 11:06, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote: On 11/17/11 10:15, Kostik Belousov wrote: On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: On 17.11.2011 00:21, Andriy Gapon wrote: on 16/11/2011 21:27 Fabian Keil said the following: Kostik Belousovkostik...@gmail.com wrote: I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. I tested the patch trying to get a dump (from the debugger) for kern/162036, which currently results in the double fault reported in: http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html It didn't help, but also didn't make anything worse. Fabian The mi_switch recursion looks very familiar to me: mi_switch() at mi_switch+0x270 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 [several pages of the previous three lines skipped] mi_switch() at mi_switch+0x275 critical_exit() at critical_exit+0x9b spinlock_exit() at spinlock_exit+0x17 intr_even_schedule_thread() at intr_event_schedule_thread+0xbb ahci_end_transaction() at ahci_end_transaction+0x398 ahci_ch_intr() at ahci_ch_intr+0x2b5 ahcipoll() at ahcipoll+0x15 xpt_polled_action() at xpt_polled_action+0xf7 In fact I once discussed with jhb this recursion triggered from a different place. To quote myself: avgspinlock_exit - critical_exit - mi_switch - kdb_switch - thread_unlock - spinlock_exit - critical_exit - mi_switch - ... avgin the kdb context avgthis issue seems to be triggered by td_owepreempt being true at the time kdb is entered avgand there of course has to be an initial spinlock_exit call somewhere avgin my case it's because of usb keyboard avgI wonder if it would make sense to clear td_owepreempt right before calling kdb_switch in mi_switch avginstead of in sched_switch() avgclearing td_owepreempt seems like a scheduler-independent operation to me avgor is it better to just skip locking in usb when kdb_active is set avg? The workaround described above should work in this case. Another possibility is to pessimize mtx_unlock_spin() implementations to check SCHEDULER_STOPPED() and to bypass any further actions in that case. But that would add unnecessary overhead to the sunny day code paths. Going further up the stack one can come up with the following proposals: - check SCHEDULER_STOPPED() swi_sched() and return early - do not call swi_sched() from xpt_done() if we somehow know that we are in a polling mode There is no flag in CAM now to indicate polling mode, but if needed, it should not be difficult to add one and not call swi_sched(). I have the following change for eons on my test boxes. Without it, I simply cannot get _any_ dump. diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c index 10b89c7..a38e42f 100644 --- a/sys/cam/cam_xpt.c +++ b/sys/cam/cam_xpt.c @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb) TAILQ_INSERT_TAIL(cam_simq, sim, links); mtx_unlock(cam_simq_lock); sim-flags |= CAM_SIM_ON_DONEQ; - if (first) + if (first panicstr == NULL) swi_sched(cambio_ih, 0); } } That should be OK for kernel dumping. I was thinking about CAM abusing polling not only for dumping. But looking on cases where it does it now, may be it is better to rewrite them instead. So, should I interpret your response as 'Reviewed by ? It feels somehow dirty to me. I don't like these global variables. If you consider it is fine, proceed, I see no much harm. But if not, I can add polling flag to the CAM. Flip a coin for me. :) You promised to add the polling at summer' meeting in Kiev. Will you do it now ? Sorry, I've probably forgot. The patch is attached. -- Alexander Motin Index: cam_sim.h === --- cam_sim.h (revision 227580) +++ cam_sim.h (working copy) @@ -104,7 +104,8 @@ u_int32_t flags; #define CAM_SIM_REL_TIMEOUT_PENDING 0x01 #define CAM_SIM_MPSAFE 0x02 -#define CAM_SIM_ON_DONEQ 0x04 +#define CAM_SIM_ON_DONEQ0x04 +#define CAM_SIM_POLLED 0x08
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote: 2011/11/7 Kostik Belousov kostik...@gmail.com: On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote: Ok. I'll offer one final suggestion. Please consider an alternative suffix to func. Perhaps, kbi or KBI. In other words, something that hints at the function's reason for existing. Sure. Below is the extraction of only vm_page_lock() bits, together with the suggested rename. When Attilio provides the promised simplification of the mutex KPI, this can be reduced. My tentative patch is here: http://www.freebsd.org/~attilio/mutexfileline.patch I need to make more compile testing later, but it already compiles GENERIC + modules fine on HEAD. The patch provides a common entrypoint, option independent, for both fast case and debug/compat case. Additively, it almost entirely fixes the standard violation of the reserved namespace, as you described (the notable exception being the macro used in the fast path, that I want to fix as well, but in a separate commit). Now the file/line couplet can be passed to the _ suffix variant of the flag functions. Yes, this is exactly KPI that I would use when available for the vm_page_lock() patch. eadler@ reviewed the mutex.h comment. Please let me know what you think about it, as long as we agree on the patch I'll commit it. But I also agree with John that imposing large churn due to the elimination of the '__' prefix is too late now. At least it will make the change non-MFCable. Besides, we already lived with the names for 10+ years. I will be happy to have the part of the patch that exports the mtx_XXX_(mtx, file, line) defines which can be used without taking care of LOCK_DEBUG or MUTEX_NOINLINE in the consumer code. pgpOwIEH4H9gF.pgp Description: PGP signature
Re: [RFC] Enable nxstack by default
On Wed, Nov 16, 2011 at 01:09:18AM +0100, Oliver Pinter wrote: On 11/15/11, Jeremie Le Hen jere...@le-hen.org wrote: Hi, On Wed, Oct 19, 2011 at 12:37:44AM +0200, Oliver Pinter wrote: In NetBSD has been some PaX feature [0] implemented. (ASLR, W^X (~nxstack), mprotect restriction, veriexec, mmap randomization[2]...) [0] http://pax.grsecurity.net/docs/index.html [1] http://www.netbsd.org/~elad/recent/man/security.8.html [2] http://people.freebsd.org/~ssouhlal/testing/stackgap-20050527.diff Suleiman actually wrought two patches, one randomizing the stack (the one you pointed out) and another one randomizing non-fixed mmap(2) calls: http://people.freebsd.org/~ssouhlal/testing/mmap_random-20050528.diff FYI, they do not apply cleanly on recent source trees (the patches were made in 2005), but they can be applied with little fiddling. I'm running multiple 8.x production machines with them without any problem. Yeah, I use thins patch in 7-STABLE and 9-STABLE too. Patch for 9-STABLE has attached. One immediate issue, which is definitely not critical, is that the size of the stack of main thread becomes chopped by the random amount of bytes. This is not an issue for single-threaded process, because typical default stack size is around 64M. For the threaded process, libthr cuts the stack, see thr_init.c:init_main_thread(). There, the size of the stack is 2 or 4MB, and 64KB might be more significant part of it. Missed bit from the patch is some randomization at the load address for the PIE (which is the main feature of ASLR, I suspect). See imgact_elf.c:exec(), et_dyn_addr calculation. Another missed bit is the similar modification for freebsd32_copyout_strings(). The upper limit for the random offset for mmap() should be configurable in the same way as stack gap, instead of the dump enable/disable knob. There are numerous style violations in the patch, or rather, the patch fully violates the style. I've always wanted them to be committed as opt-in knobs, but I can't remember why they hadn't at the time. Cheers, -- Jeremie Le Hen Men are born free and equal. Later on, they're on their own. Jean Yanne pgp0wWrKbOxRT.pgp Description: PGP signature
Re: [head tinderbox] failure on amd64/amd64
On Wed, Nov 16, 2011 at 06:18:59PM +, FreeBSD Tinderbox wrote: TB --- 2011-11-16 15:30:00 - tinderbox 2.8 running on freebsd-current.sentex.ca TB --- 2011-11-16 15:30:00 - starting HEAD tinderbox run for amd64/amd64 TB --- 2011-11-16 15:30:00 - cleaning the object tree TB --- 2011-11-16 15:31:04 - cvsupping the source tree TB --- 2011-11-16 15:31:04 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca /tinderbox/HEAD/amd64/amd64/supfile TB --- 2011-11-16 15:31:16 - building world TB --- 2011-11-16 15:31:16 - CROSS_BUILD_TESTING=YES TB --- 2011-11-16 15:31:16 - MAKEOBJDIRPREFIX=/obj TB --- 2011-11-16 15:31:16 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-11-16 15:31:16 - SRCCONF=/dev/null TB --- 2011-11-16 15:31:16 - TARGET=amd64 TB --- 2011-11-16 15:31:16 - TARGET_ARCH=amd64 TB --- 2011-11-16 15:31:16 - TZ=UTC TB --- 2011-11-16 15:31:16 - __MAKE_CONF=/dev/null TB --- 2011-11-16 15:31:16 - cd /src TB --- 2011-11-16 15:31:16 - /usr/bin/make -B buildworld World build started on Wed Nov 16 15:31:17 UTC 2011 Rebuilding the temporary build tree stage 1.1: legacy release compatibility shims stage 1.2: bootstrap tools stage 2.1: cleaning up the object tree stage 2.2: rebuilding the object tree stage 2.3: build tools stage 3: cross tools stage 4.1: building includes stage 4.2: building libraries stage 4.3: make dependencies stage 4.4: building everything stage 5.1: building 32 bit shim libraries World build completed on Wed Nov 16 18:08:28 UTC 2011 TB --- 2011-11-16 18:08:28 - generating LINT kernel config TB --- 2011-11-16 18:08:28 - cd /src/sys/amd64/conf TB --- 2011-11-16 18:08:28 - /usr/bin/make -B LINT TB --- 2011-11-16 18:08:28 - cd /src/sys/amd64/conf TB --- 2011-11-16 18:08:28 - /usr/sbin/config -m LINT-NOINET TB --- 2011-11-16 18:08:28 - building LINT-NOINET kernel TB --- 2011-11-16 18:08:28 - CROSS_BUILD_TESTING=YES TB --- 2011-11-16 18:08:28 - MAKEOBJDIRPREFIX=/obj TB --- 2011-11-16 18:08:28 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2011-11-16 18:08:28 - SRCCONF=/dev/null TB --- 2011-11-16 18:08:28 - TARGET=amd64 TB --- 2011-11-16 18:08:28 - TARGET_ARCH=amd64 TB --- 2011-11-16 18:08:28 - TZ=UTC TB --- 2011-11-16 18:08:28 - __MAKE_CONF=/dev/null TB --- 2011-11-16 18:08:28 - cd /src TB --- 2011-11-16 18:08:28 - /usr/bin/make -B buildkernel KERNCONF=LINT-NOINET Kernel build for LINT-NOINET started on Wed Nov 16 18:08:28 UTC 2011 stage 1: configuring the kernel stage 2.1: cleaning up the object tree stage 2.2: rebuilding the object tree stage 2.3: build tools stage 3.1: making dependencies stage 3.2: building everything [...] cc -c -O2 -frename-registers -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -Wmissing-include-dirs -fdiagnostics-show-option -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone -mno-mmx -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fstack-protector -Werror -pg -mprofiler-epilogue /src/sys/fs/procfs/procfs_type.c cc -c -O2 -frename-registers -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -Wmissing-include-dirs -fdiagnostics-show-option -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone -mno-mmx -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fstack-protector -Werror -pg -mprofiler-epilogue /src/sys/fs/pseudofs/pseudofs.c cc -c -O2 -frename-registers -pipe -fno-strict-aliasing -std=c99 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -Wmissing-include-dirs -fdiagnostics-show-option -nostdinc -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone -mno-mmx -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fstack-protector -Werror -pg -mprofiler-epilogue
Re: Stop scheduler on panic
On Mon, Nov 14, 2011 at 12:06:48PM +0200, Andriy Gapon wrote: on 13/11/2011 10:32 Kostik Belousov said the following: I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. Several people already saw the patchset, and I remember that Andrey posted it to some lists. The change stops other (*) processors early upon the panic. This way, no parallel manipulation of the kernel memory is performed by CPUs. In particular, the kernel memory map is static. Patch prevents the panic thread from blocking and switching out. * - in the context of the description, other means not current. Since other threads are not run anymore, lock owner cannot release a lock which is required by panic thread. Due to this, we need to fake a lock acquisition after the panic, which adds minimal overhead to the locking cost. The patch tries to not add any overhead on the fast path of the lock acquire. The check for the after-panic condition was reduced to single memory access, done only when the quick cas lock attempt failed, and braced with __unlikely compiler hint. For now, the new mode of operation is disabled by default, since some further USB changes are needed to make USB keyboard usable in that environment. With the patch, getting a dump from the machine without debugger compiled in is much more realistic. Please comment, I will commit the change in 2 weeks unless strong reasons not to are given. http://people.freebsd.org/~kib/misc/stop_cpus_on_panic.1.patch On a more serious note: - some code in my latest version of the patch was contributed by or was based on the code or ideas contributed by jhb and mdf (so that attributions are not lost) Please provide me with proper attribution for contributors and testers. - there was a concern about how sync-on-panic would work About the latter, I have never really tested it. mdf has suggested to move the sync-on-panic code to a place after we ensure that there is only one CPU in panic(), but before we stop other CPUs. sync_on_panic is incompatible with the patch. I argue that it provides non-zero chance of damaging good filesystems even if panic was unrelated to the fs/bio/device layer. As an example, consider the case when other CPU was modifying in-memory representation of the metadata, and panic happen on this CPU. If you write half-changed block back, you make more damage to the filesystem then if you do not. The half-backed sync spoils any journaling or SU consistency guarantees. The issues of multithreading nature of our storage subsystem are secondary. The user who sets either tunable shall know what he does. I think that I've already sent you a list of the early testers for various WIP versions of the patch. I do not have the list. BTW, if you want, feel free to handle the commit youself. You definitely spent much more efforts on the stuff and deserve the credit. I was promised in private that a review will be provided during this weekend. pgpyIteKouhHY.pgp Description: PGP signature
Stop scheduler on panic
I was tricked into finishing the work by Andrey Gapon, who developed the patch to reliably stop other processors on panic. The patch greatly improves the chances of getting dump on panic on SMP host. Several people already saw the patchset, and I remember that Andrey posted it to some lists. The change stops other (*) processors early upon the panic. This way, no parallel manipulation of the kernel memory is performed by CPUs. In particular, the kernel memory map is static. Patch prevents the panic thread from blocking and switching out. * - in the context of the description, other means not current. Since other threads are not run anymore, lock owner cannot release a lock which is required by panic thread. Due to this, we need to fake a lock acquisition after the panic, which adds minimal overhead to the locking cost. The patch tries to not add any overhead on the fast path of the lock acquire. The check for the after-panic condition was reduced to single memory access, done only when the quick cas lock attempt failed, and braced with __unlikely compiler hint. For now, the new mode of operation is disabled by default, since some further USB changes are needed to make USB keyboard usable in that environment. With the patch, getting a dump from the machine without debugger compiled in is much more realistic. Please comment, I will commit the change in 2 weeks unless strong reasons not to are given. http://people.freebsd.org/~kib/misc/stop_cpus_on_panic.1.patch pgpCmKhLFDhgT.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote: Ok. I'll offer one final suggestion. Please consider an alternative suffix to func. Perhaps, kbi or KBI. In other words, something that hints at the function's reason for existing. Sure. Below is the extraction of only vm_page_lock() bits, together with the suggested rename. When Attilio provides the promised simplification of the mutex KPI, this can be reduced. diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 389aea5..ea4ea34 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2677,6 +2677,44 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_KBI(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_lock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_lock(vm_page_lockptr(m), 0, file, line); +#endif +} + +void +vm_page_unlock_KBI(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line); +#endif +} + +int +vm_page_trylock_KBI(vm_page_t m, const char *file, int line) +{ + + return (_mtx_trylock(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line) +{ + +#ifdef INVARIANTS + _mtx_assert(vm_page_lockptr(m), a, file, line); +#endif +} + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 7099b70..a5604b7 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[]; #definePA_LOCK_ASSERT(pa, a) mtx_assert(PA_LOCKPTR(pa), (a)) +#ifdef KLD_MODULE +#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_unlock(m) vm_page_unlock_KBI((m), LOCK_FILE, LOCK_LINE) +#definevm_page_trylock(m) vm_page_trylock_KBI((m), LOCK_FILE, LOCK_LINE) +#ifdef INVARIANTS +#definevm_page_lock_assert(m, a) \ +vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE) +#else +#definevm_page_lock_assert(m, a) +#endif +#else /* KLD_MODULE */ #definevm_page_lockptr(m) (PA_LOCKPTR(VM_PAGE_TO_PHYS((m #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m))) #definevm_page_unlock(m) mtx_unlock(vm_page_lockptr((m))) #definevm_page_trylock(m) mtx_trylock(vm_page_lockptr((m))) #definevm_page_lock_assert(m, a) mtx_assert(vm_page_lockptr((m)), (a)) +#endif #definevm_page_queue_free_mtx vm_page_queue_free_lock.data /* @@ -403,6 +415,11 @@ void vm_page_cowfault (vm_page_t); int vm_page_cowsetup(vm_page_t); void vm_page_cowclear (vm_page_t); +void vm_page_lock_KBI(vm_page_t m, const char *file, int line); +void vm_page_unlock_KBI(vm_page_t m, const char *file, int line); +int vm_page_trylock_KBI(vm_page_t m, const char *file, int line); +void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line); + #ifdef INVARIANTS void vm_page_object_lock_assert(vm_page_t m); #defineVM_PAGE_OBJECT_LOCK_ASSERT(m) vm_page_object_lock_assert(m) pgpxFZVFLBvYF.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Mon, Nov 07, 2011 at 11:47:59AM -0800, m...@freebsd.org wrote: On Mon, Nov 7, 2011 at 11:35 AM, Kostik Belousov kostik...@gmail.com wrote: On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote: Ok. I'll offer one final suggestion. Please consider an alternative suffix to func. Perhaps, kbi or KBI. In other words, something that hints at the function's reason for existing. Sure. Below is the extraction of only vm_page_lock() bits, together with the suggested rename. When Attilio provides the promised simplification of the mutex KPI, this can be reduced. diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 389aea5..ea4ea34 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2677,6 +2677,44 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_KBI(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_lock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_lock(vm_page_lockptr(m), 0, file, line); +#endif +} + +void +vm_page_unlock_KBI(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line); +#endif +} + +int +vm_page_trylock_KBI(vm_page_t m, const char *file, int line) +{ + + return (_mtx_trylock(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line) +{ + +#ifdef INVARIANTS + _mtx_assert(vm_page_lockptr(m), a, file, line); +#endif +} + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 7099b70..a5604b7 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[]; #define PA_LOCK_ASSERT(pa, a) mtx_assert(PA_LOCKPTR(pa), (a)) +#ifdef KLD_MODULE +#define vm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, LOCK_LINE) +#define vm_page_unlock(m) vm_page_unlock_KBI((m), LOCK_FILE, LOCK_LINE) +#define vm_page_trylock(m) vm_page_trylock_KBI((m), LOCK_FILE, LOCK_LINE) +#ifdef INVARIANTS +#define vm_page_lock_assert(m, a) \ + vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE) +#else +#define vm_page_lock_assert(m, a) +#endif +#else /* KLD_MODULE */ #define vm_page_lockptr(m) (PA_LOCKPTR(VM_PAGE_TO_PHYS((m Is it not possible to have vm_page_lockptr() be a function for the KLD_MODULE case? Because then the vm_page_lock() functions and friends would all just use mtx_lock, etc., in both the KLD and non-KLD case. Or am I missing something? It is possible, but I tried to avoid exposing lockptr. IMHO vm_page_lockptr() is an implementation detail. Please also see my other response to Alan regarding the relocking macros. pgpYgFHehhwCS.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sat, Nov 05, 2011 at 03:00:58PM -0500, Alan Cox wrote: On 11/05/2011 10:15, Kostik Belousov wrote: On Sat, Nov 05, 2011 at 07:37:48AM -0700, m...@freebsd.org wrote: On Sat, Nov 5, 2011 at 7:13 AM, Kostik Belousovkostik...@gmail.com wrote: On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote: Below is the KBI patch after vm_page_bits_t merge is done. Again, I did not spent time converting all in-tree consumers from the (potentially) loadable modules to the new KPI until it is agreed upon. I like my bikeshed orange... I would think a more canonical name would be get/set rather than read/write, especially since these operations aren't reading and writing the page (neither are they getting the page, but at least set is pretty unambiguous). Look at the vm_page.h:385. vm_page_set_valid() is already taken. I don't feel good about creating an interface under which we have functions named vm_page_set_valid() and vm_page_write_valid(). I think that we should take a step back and look at the whole of set of functions that exist for manipulating the page's valid and dirty field and see if we can come up with a logical schema for naming them. I wouldn't then be surprised if this results in renaming some of the existing functions. However, this should not delay the changes to address the vm_page_lock problem. I had two questions about that part of your patch. First, is there any reason that you didn't include vm_page_lockptr()? If we created the page locking macros that you, jhb@, and I were talking about last week, then vm_page_lockptr() would be required. Second, there seems to be precedent for naming the locking functions _vm_page_lock() instead of vm_page_lock_func(), for example, the mutex code, i.e., sys/mutex.h, and the vm map locking functions. I think vm_page_lockptr() should be included when some kind of reloc-iterating macros are actually introduced into the tree. And, I have a hope that they can be wrapped around a function with the signature like void vm_page_relock(vm_page_t locked_page, vm_page_t unlocked_page); which moves the lock from locked_page to unlocked_page. Regarding the _vm_page_lock() vs. vm_page_lock_func(), the mutex.h has a lot of violations in regard of the namespaces, IMO. The __* namespace is reserved for the language implementation, so our freestanding program (kernel) ignores the requirements of the C standard with the names like __mtx_lock_spin(). Using the name _vm_page_lock() is valid, but makes it not unreasonable for other developers to introduce reserved names. So I decided to use the suffixes. vm_map.h locking is free of these violations. pgpr5pwKbvDyo.pgp Description: PGP signature
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sun, Nov 06, 2011 at 07:22:51AM -0800, m...@freebsd.org wrote: On Sun, Nov 6, 2011 at 4:43 AM, Kostik Belousov kostik...@gmail.com wrote: Regarding the _vm_page_lock() vs. vm_page_lock_func(), the mutex.h has a lot of violations in regard of the namespaces, IMO. The __* namespace is reserved for the language implementation, so our freestanding program (kernel) ignores the requirements of the C standard with the names like __mtx_lock_spin(). Using the name _vm_page_lock() is valid, but makes it not unreasonable for other developers to introduce reserved names. So I decided to use the suffixes. vm_map.h locking is free of these violations. I'm pretty sure that when the C standard says, the implementation, they're referring to the compiler and OS it runs on. Which makes the FreeBSD kernel part of the implementation, which is precisely why so many headers have defines that start with __ and then, if certain posix defines are set, also uses non-__ versions of the name. For libc providing parts, required by standard, you are right. But our kernel is a freestanding program using a compiler, so in-kernel uses of the reserved namespace is a violation. pgp4CGtNzn8lW.pgp Description: PGP signature
Re: 9.0/i386 build failure
On Sat, Nov 05, 2011 at 02:20:03PM +0100, Dimitry Andric wrote: On 2011-11-04 20:09, Michael W. Lucas wrote: I suspect I'm building on a system that's too old, but it's worth asking. FreeBSD eyeball.lodden.com 9.0-CURRENT FreeBSD 9.0-CURRENT #0: Sat Aug 29 00:31:14 EDT 2009 mwlu...@stretchlimo.blackhelicopters.org:/usr/obj/usr/src/sys/GENERIC i386 csup today. no /etc/src.conf, /etc/make.conf only contains a perl definition. ... c++ -O2 -pipe -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/include -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/tools/clang/include -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/utils/TableGen -I. -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/../../lib/clang/include -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -DLLVM_HOSTTRIPLE=\i386-unknown-freebsd9.0\ -I/usr/obj/usr/src/tmp/legacy/usr/include -static -L/usr/obj/usr/src/tmp/legacy/usr/lib -o tblgen ARMDecoderEmitter.o AsmMatcherEmitter.o AsmWriterEmitter.o AsmWriterInst.o CallingConvEmitter.o CodeEmitterGen.o CodeGenDAGPatterns.o CodeGenInstruction.o CodeGenRegisters.o CodeGenTarget.o DAGISelEmitter.o DAGISelMatcher.o DAGISelMatcherEmitter.o DAGISelMatcherGen.o DAGISelMatcherOpt.o DisassemblerEmitter.o EDEmitter.o FastISelEmitter.o FixedLenDecoderEmitter.o InstrEnumEmitter.o InstrInfoEmitter.o IntrinsicEmitter.o PseudoLoweringEmitter.o RegisterInfoEmi tter.o SetTheory.o StringMatcher.o SubtargetEmitter.o TGValueTypes.o TableGen.o X86DisassemblerTables.o X86RecognizableInstr.o /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmtablegen/libllvmtablegen.a /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmsupport/libllvmsupport.a -legacy /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmsupport/libllvmsupport.a(Atomic.o)(.text+0x25): In function `llvm::sys::AtomicIncrement(unsigned int volatile*)': : undefined reference to `__sync_add_and_fetch_4' Hm, the only way I can imagine this happening, is then you compile your system with -march=i386. In that case the atomic primitives, such as __sync_add_and_fetch, are called as external functions, which would lead to a link error like the above. I tried building 9-STABLE from a 9-CURRENT box that was not updated since April 2011, and from a fairly recent 8-STABLE, but both did not exhibit this issue. Are you using any special other settings, e.g. in your environment, or on the make command line? In case, it would be handy if you post the full buildlog somewhere, so we can see which flags have been used to compile the llvmsupport library. The system gcc was changed to assume march=486 some time ago. I suppose that the current-9 system is before the change, see r198344. pgpZInM9LLhcY.pgp Description: PGP signature
vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote: Below is the KBI patch after vm_page_bits_t merge is done. Again, I did not spent time converting all in-tree consumers from the (potentially) loadable modules to the new KPI until it is agreed upon. diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c index 305c189..7264cd1 100644 --- a/sys/nfsclient/nfs_bio.c +++ b/sys/nfsclient/nfs_bio.c @@ -128,7 +128,7 @@ nfs_getpages(struct vop_getpages_args *ap) * can only occur at the file EOF. */ VM_OBJECT_LOCK(object); - if (pages[ap-a_reqpage]-valid != 0) { + if (vm_page_read_valid(pages[ap-a_reqpage]) != 0) { for (i = 0; i npages; ++i) { if (i != ap-a_reqpage) { vm_page_lock(pages[i]); @@ -198,16 +198,16 @@ nfs_getpages(struct vop_getpages_args *ap) /* * Read operation filled an entire page */ - m-valid = VM_PAGE_BITS_ALL; - KASSERT(m-dirty == 0, + vm_page_write_valid(m, VM_PAGE_BITS_ALL); + KASSERT(vm_page_read_dirty(m) == 0, (nfs_getpages: page %p is dirty, m)); } else if (size toff) { /* * Read operation filled a partial page. */ - m-valid = 0; + vm_page_write_valid(m, 0); vm_page_set_valid(m, 0, size - toff); - KASSERT(m-dirty == 0, + KASSERT(vm_page_read_dirty(m) == 0, (nfs_getpages: page %p is dirty, m)); } else { /* diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index 389aea5..2f41e70 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2677,6 +2677,66 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_func(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_lock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_lock(vm_page_lockptr(m), 0, file, line); +#endif +} + +void +vm_page_unlock_func(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line); +#endif +} + +int +vm_page_trylock_func(vm_page_t m, const char *file, int line) +{ + + return (_mtx_trylock(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_func(vm_page_t m, int a, const char *file, int line) +{ + +#ifdef INVARIANTS + _mtx_assert(vm_page_lockptr(m), a, file, line); +#endif +} + +vm_page_bits_t +vm_page_read_dirty_func(vm_page_t m) +{ + + return (m-dirty); +} + +vm_page_bits_t +vm_page_read_valid_func(vm_page_t m) +{ + + return (m-valid); +} + +void +vm_page_write_valid_func(vm_page_t m, vm_page_bits_t v) +{ + + m-valid = v; +} + + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 7099b70..4f8f71e 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -218,12 +218,50 @@ extern struct vpglocks pa_lock[]; #definePA_LOCK_ASSERT(pa, a) mtx_assert(PA_LOCKPTR(pa), (a)) +#ifdef KLD_MODULE +#definevm_page_lock(m) vm_page_lock_func((m), LOCK_FILE, LOCK_LINE) +#definevm_page_unlock(m) vm_page_unlock_func((m), LOCK_FILE, LOCK_LINE) +#definevm_page_trylock(m) vm_page_trylock_func((m), LOCK_FILE, LOCK_LINE) +#ifdef INVARIANTS +#definevm_page_lock_assert(m, a) \ +vm_page_lock_assert_func((m), (a), LOCK_FILE, LOCK_LINE) +#else +#definevm_page_lock_assert(m, a) +#endif + +#definevm_page_read_dirty(m) vm_page_read_dirty_func((m)) +#definevm_page_read_valid(m) vm_page_read_valid_func((m)) +#definevm_page_write_valid(m, v) vm_page_write_valid_func((m), (v)) + +#else /* KLD_MODULE */ #definevm_page_lockptr(m) (PA_LOCKPTR(VM_PAGE_TO_PHYS((m #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m))) #definevm_page_unlock(m) mtx_unlock(vm_page_lockptr((m))) #definevm_page_trylock(m) mtx_trylock(vm_page_lockptr((m))) #definevm_page_lock_assert(m, a) mtx_assert(vm_page_lockptr((m)), (a)) +static inline vm_page_bits_t +vm_page_read_dirty(vm_page_t m) +{ + + return (m-dirty); +} + +static inline vm_page_bits_t +vm_page_read_valid(vm_page_t m) +{ + + return (m-valid); +} + +static inline void +vm_page_write_valid(vm_page_t m, vm_page_bits_t v) +{ + + m-valid = v; +} +#endif + #definevm_page_queue_free_mtx vm_page_queue_free_lock.data /* * These are the flags defined for vm_page
Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]
On Sat, Nov 05, 2011 at 07:37:48AM -0700, m...@freebsd.org wrote: On Sat, Nov 5, 2011 at 7:13 AM, Kostik Belousov kostik...@gmail.com wrote: On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote: Below is the KBI patch after vm_page_bits_t merge is done. Again, I did not spent time converting all in-tree consumers from the (potentially) loadable modules to the new KPI until it is agreed upon. I like my bikeshed orange... I would think a more canonical name would be get/set rather than read/write, especially since these operations aren't reading and writing the page (neither are they getting the page, but at least set is pretty unambiguous). Look at the vm_page.h:385. vm_page_set_valid() is already taken. pgpw68GTgJGiJ.pgp Description: PGP signature
Re: 9.0/i386 build failure
On Sat, Nov 05, 2011 at 08:06:26PM +0100, Dimitry Andric wrote: On 2011-11-05 14:28, Kostik Belousov wrote: On Sat, Nov 05, 2011 at 02:20:03PM +0100, Dimitry Andric wrote: On 2011-11-04 20:09, Michael W. Lucas wrote: ... : undefined reference to `__sync_add_and_fetch_4' ... The system gcc was changed to assume march=486 some time ago. I suppose that the current-9 system is before the change, see r198344. Yes, that is most likely the cause of this problem. It is reproducible if you set CC to 'gcc -march=i386' and CXX to 'g++ -march=i386'. At first, I thought it would be easily fixable, since when you use -march=i386, the macro __tune_i386__ is defined, so you can disable LLVM's use of atomic builtins. However, since r212286, libstdc++ is also configured to use atomic builtins, and any non-trivial C++ program will encounter this linking issue when compiling with -march=i386. A workaround could be this: Index: gnu/lib/libstdc++/config.h === --- gnu/lib/libstdc++/config.h (revision 227112) +++ gnu/lib/libstdc++/config.h (working copy) @@ -671,7 +671,7 @@ /* #undef VERSION */ /* Define if builtin atomic operations are supported on this host. */ -#if defined(__amd64__) || defined(__i386__) +#if defined(__amd64__) || (defined(__i386__) !defined(__tune_i386__)) #define _GLIBCXX_ATOMIC_BUILTINS 1 #endif but unfortunately during the bootstrap stage, the system includes are used, not those in the source tree... At the moment I don't know a clean way out of this, except setting CC to 'gcc -march=i486' and CXX to 'g++ -march=i486', and then building the bootstrap-tools stage should at least complete successfully. I believe that sources were bootstrapable after the commit, with the old world. The solution could be to checkout exactly r198344, build and install new world, and then checkout latest HEAD. Personally, I would install from any snapshot, 9.0RC1 is good enough for later HEAD rebuild. pgpjNP4OIDsRB.pgp Description: PGP signature
Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3
On Thu, Nov 03, 2011 at 12:51:10PM -0500, Alan Cox wrote: On 11/03/2011 08:24, Kostik Belousov wrote: On Thu, Nov 03, 2011 at 12:40:08AM -0500, Alan Cox wrote: On 11/02/2011 05:32, Andriy Gapon wrote: [restored cc: to the original poster] As Bruce Evans has pointed to me privately [I am not sure why privately], there is already an example in i386 and amd64 atomic.h, where operations are inlined for a kernel build, but presented as real (external) functions for a module build. You can search e.g. sys/amd64/include/atomic.h for KLD_MODULE. I think that the same treatment could/should be applied to vm_page_*lock* operations defined in sys/vm/vm_page.h. *snip* Yes, it should be. There are without question legitimate reasons for a module to acquire a page lock. I agree. Also, I think that we should use the opportunity to also isolate the modules from the struct vm_page layout changes. As example, I converted nfsclient.ko. I would suggest introducing the vm_page_bits_t change first. If, at the same time, you change the return type from the function vm_page_bits() to use vm_page_bits_t, then I believe it is straightforward to fix all of the places in vm_page.c that don't properly handle a 32 KB page size. Ok, I think this is orhtohonal to the ABI issue. The vm_page_bits_t applied. diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index f14da4a..f22c34a 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -137,7 +137,7 @@ SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD, static uma_zone_t fakepg_zone; -static void vm_page_clear_dirty_mask(vm_page_t m, int pagebits); +static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits); static void vm_page_queue_remove(int queue, vm_page_t m); static void vm_page_enqueue(int queue, vm_page_t m); static void vm_page_init_fakepg(void *dummy); @@ -2350,7 +2350,7 @@ retrylookup: * * Inputs are required to range within a page. */ -int +vm_page_bits_t vm_page_bits(int base, int size) { int first_bit; @@ -2367,7 +2367,8 @@ vm_page_bits(int base, int size) first_bit = base DEV_BSHIFT; last_bit = (base + size - 1) DEV_BSHIFT; - return ((2 last_bit) - (1 first_bit)); + return (((vm_page_bits_t)2 last_bit) - + ((vm_page_bits_t)1 first_bit)); } /* @@ -2426,7 +2427,7 @@ vm_page_set_valid(vm_page_t m, int base, int size) * Clear the given bits from the specified page's dirty field. */ static __inline void -vm_page_clear_dirty_mask(vm_page_t m, int pagebits) +vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits) { uintptr_t addr; #if PAGE_SIZE 16384 @@ -2455,7 +2456,6 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits) */ addr = (uintptr_t)m-dirty; #if PAGE_SIZE == 32768 -#error pagebits too short atomic_clear_64((uint64_t *)addr, pagebits); #elif PAGE_SIZE == 16384 atomic_clear_32((uint32_t *)addr, pagebits); @@ -2492,8 +2492,8 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits) void vm_page_set_validclean(vm_page_t m, int base, int size) { - u_long oldvalid; - int endoff, frag, pagebits; + vm_page_bits_t oldvalid, pagebits; + int endoff, frag; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); if (size == 0) /* handle degenerate case */ @@ -2505,7 +2505,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size) * first block. */ if ((frag = base ~(DEV_BSIZE - 1)) != base - (m-valid (1 (base DEV_BSHIFT))) == 0) + (m-valid ((vm_page_bits_t)1 (base DEV_BSHIFT))) == 0) pmap_zero_page_area(m, frag, base - frag); /* @@ -2515,7 +2515,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size) */ endoff = base + size; if ((frag = endoff ~(DEV_BSIZE - 1)) != endoff - (m-valid (1 (endoff DEV_BSHIFT))) == 0) + (m-valid ((vm_page_bits_t)1 (endoff DEV_BSHIFT))) == 0) pmap_zero_page_area(m, endoff, DEV_BSIZE - (endoff (DEV_BSIZE - 1))); @@ -2585,7 +2585,7 @@ vm_page_clear_dirty(vm_page_t m, int base, int size) void vm_page_set_invalid(vm_page_t m, int base, int size) { - int bits; + vm_page_bits_t bits; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); KASSERT((m-oflags VPO_BUSY) == 0, @@ -2656,9 +2656,10 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid) int vm_page_is_valid(vm_page_t m, int base, int size) { - int bits = vm_page_bits(base, size); + vm_page_bits_t bits; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); + bits = vm_page_bits(base, size); if (m-valid ((m-valid bits) == bits)) return 1; else diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 23637bb..7d1a529 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -113,6 +113,21 @@ TAILQ_HEAD(pglist
Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3
On Fri, Nov 04, 2011 at 10:09:09AM -0500, Alan Cox wrote: On 11/04/2011 05:08, Kostik Belousov wrote: On Thu, Nov 03, 2011 at 12:51:10PM -0500, Alan Cox wrote: I would suggest introducing the vm_page_bits_t change first. If, at the same time, you change the return type from the function vm_page_bits() to use vm_page_bits_t, then I believe it is straightforward to fix all of the places in vm_page.c that don't properly handle a 32 KB page size. Ok, I think this is orhtohonal to the ABI issue. The vm_page_bits_t applied. Agreed, which is why I wanted to separate the two things. I've made a few comments below. ... Looks good. I will make universe the patch below. Any further notes ? diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index f14da4a..f398453 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -137,7 +137,7 @@ SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD, static uma_zone_t fakepg_zone; -static void vm_page_clear_dirty_mask(vm_page_t m, int pagebits); +static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits); static void vm_page_queue_remove(int queue, vm_page_t m); static void vm_page_enqueue(int queue, vm_page_t m); static void vm_page_init_fakepg(void *dummy); @@ -2350,7 +2350,7 @@ retrylookup: * * Inputs are required to range within a page. */ -int +vm_page_bits_t vm_page_bits(int base, int size) { int first_bit; @@ -2367,7 +2367,8 @@ vm_page_bits(int base, int size) first_bit = base DEV_BSHIFT; last_bit = (base + size - 1) DEV_BSHIFT; - return ((2 last_bit) - (1 first_bit)); + return (((vm_page_bits_t)2 last_bit) - + ((vm_page_bits_t)1 first_bit)); } /* @@ -2426,7 +2427,7 @@ vm_page_set_valid(vm_page_t m, int base, int size) * Clear the given bits from the specified page's dirty field. */ static __inline void -vm_page_clear_dirty_mask(vm_page_t m, int pagebits) +vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits) { uintptr_t addr; #if PAGE_SIZE 16384 @@ -2455,7 +2456,6 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits) */ addr = (uintptr_t)m-dirty; #if PAGE_SIZE == 32768 -#error pagebits too short atomic_clear_64((uint64_t *)addr, pagebits); #elif PAGE_SIZE == 16384 atomic_clear_32((uint32_t *)addr, pagebits); @@ -2492,8 +2492,8 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits) void vm_page_set_validclean(vm_page_t m, int base, int size) { - u_long oldvalid; - int endoff, frag, pagebits; + vm_page_bits_t oldvalid, pagebits; + int endoff, frag; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); if (size == 0) /* handle degenerate case */ @@ -2505,7 +2505,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size) * first block. */ if ((frag = base ~(DEV_BSIZE - 1)) != base - (m-valid (1 (base DEV_BSHIFT))) == 0) + (m-valid ((vm_page_bits_t)1 (base DEV_BSHIFT))) == 0) pmap_zero_page_area(m, frag, base - frag); /* @@ -2515,7 +2515,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size) */ endoff = base + size; if ((frag = endoff ~(DEV_BSIZE - 1)) != endoff - (m-valid (1 (endoff DEV_BSHIFT))) == 0) + (m-valid ((vm_page_bits_t)1 (endoff DEV_BSHIFT))) == 0) pmap_zero_page_area(m, endoff, DEV_BSIZE - (endoff (DEV_BSIZE - 1))); @@ -2585,7 +2585,7 @@ vm_page_clear_dirty(vm_page_t m, int base, int size) void vm_page_set_invalid(vm_page_t m, int base, int size) { - int bits; + vm_page_bits_t bits; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); KASSERT((m-oflags VPO_BUSY) == 0, @@ -2625,7 +2625,7 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid) */ for (b = i = 0; i = PAGE_SIZE / DEV_BSIZE; ++i) { if (i == (PAGE_SIZE / DEV_BSIZE) || - (m-valid (1 i)) + (m-valid ((vm_page_bits_t)1 i)) ) { if (i b) { pmap_zero_page_area(m, @@ -2656,9 +2656,10 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid) int vm_page_is_valid(vm_page_t m, int base, int size) { - int bits = vm_page_bits(base, size); + vm_page_bits_t bits; VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED); + bits = vm_page_bits(base, size); if (m-valid ((m-valid bits) == bits)) return 1; else diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 23637bb..e3eb08c 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -113,6 +113,20 @@ TAILQ_HEAD(pglist, vm_page); +#if PAGE_SIZE == 4096 +#define VM_PAGE_BITS_ALL 0xffu +typedef uint8_t vm_page_bits_t; +#elif PAGE_SIZE == 8192 +#define VM_PAGE_BITS_ALL 0xu +typedef uint16_t vm_page_bits_t; +#elif PAGE_SIZE == 16384
Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3
On Fri, Nov 04, 2011 at 10:48:45AM -0500, Alan Cox wrote: On 11/04/2011 10:30, Kostik Belousov wrote: for (b = i = 0; i= PAGE_SIZE / DEV_BSIZE; ++i) { if (i == (PAGE_SIZE / DEV_BSIZE) || -(m-valid (1 i)) +(m-valid ((vm_page_bits_t)1 i)) ) { if (i b) { pmap_zero_page_area(m, While we're here, we might as well fix the old style bug above by moving the ) { to the proper place. Fixed, this does not warrant the universe restart. pgpxPx0PIpYaC.pgp Description: PGP signature
Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3
On Thu, Nov 03, 2011 at 12:40:08AM -0500, Alan Cox wrote: On 11/02/2011 05:32, Andriy Gapon wrote: [restored cc: to the original poster] As Bruce Evans has pointed to me privately [I am not sure why privately], there is already an example in i386 and amd64 atomic.h, where operations are inlined for a kernel build, but presented as real (external) functions for a module build. You can search e.g. sys/amd64/include/atomic.h for KLD_MODULE. I think that the same treatment could/should be applied to vm_page_*lock* operations defined in sys/vm/vm_page.h. *snip* Yes, it should be. There are without question legitimate reasons for a module to acquire a page lock. I agree. Also, I think that we should use the opportunity to also isolate the modules from the struct vm_page layout changes. As example, I converted nfsclient.ko. Patch is not tested. diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c index 305c189..7264cd1 100644 --- a/sys/nfsclient/nfs_bio.c +++ b/sys/nfsclient/nfs_bio.c @@ -128,7 +128,7 @@ nfs_getpages(struct vop_getpages_args *ap) * can only occur at the file EOF. */ VM_OBJECT_LOCK(object); - if (pages[ap-a_reqpage]-valid != 0) { + if (vm_page_read_valid(pages[ap-a_reqpage]) != 0) { for (i = 0; i npages; ++i) { if (i != ap-a_reqpage) { vm_page_lock(pages[i]); @@ -198,16 +198,16 @@ nfs_getpages(struct vop_getpages_args *ap) /* * Read operation filled an entire page */ - m-valid = VM_PAGE_BITS_ALL; - KASSERT(m-dirty == 0, + vm_page_write_valid(m, VM_PAGE_BITS_ALL); + KASSERT(vm_page_read_dirty(m) == 0, (nfs_getpages: page %p is dirty, m)); } else if (size toff) { /* * Read operation filled a partial page. */ - m-valid = 0; + vm_page_write_valid(m, 0); vm_page_set_valid(m, 0, size - toff); - KASSERT(m-dirty == 0, + KASSERT(vm_page_read_dirty(m) == 0, (nfs_getpages: page %p is dirty, m)); } else { /* diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c index f14da4a..5b8b4e3 100644 --- a/sys/vm/vm_page.c +++ b/sys/vm/vm_page.c @@ -2677,6 +2677,66 @@ vm_page_test_dirty(vm_page_t m) vm_page_dirty(m); } +void +vm_page_lock_func(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_lock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_lock(vm_page_lockptr(m), 0, file, line); +#endif +} + +void +vm_page_unlock_func(vm_page_t m, const char *file, int line) +{ + +#if LOCK_DEBUG 0 || defined(MUTEX_NOINLINE) + _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line); +#else + __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line); +#endif +} + +int +vm_page_trylock_func(vm_page_t m, const char *file, int line) +{ + + return (_mtx_trylock(vm_page_lockptr(m), 0, file, line)); +} + +void +vm_page_lock_assert_func(vm_page_t m, int a, const char *file, int line) +{ + +#ifdef INVARIANTS + _mtx_assert(vm_page_lockptr(m), a, file, line); +#endif +} + +vm_page_bits_t +vm_page_read_dirty_func(vm_page_t m) +{ + + return (m-dirty); +} + +vm_page_bits_t +vm_page_read_valid_func(vm_page_t m) +{ + + return (m-valid); +} + +void +vm_page_write_valid_func(vm_page_t m, vm_page_bits_t v) +{ + + m-valid = v; +} + + int so_zerocp_fullpage = 0; /* diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 23637bb..618ba2b 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -113,6 +113,21 @@ TAILQ_HEAD(pglist, vm_page); +#if PAGE_SIZE == 4096 +#define VM_PAGE_BITS_ALL 0xffu +typedef uint8_t vm_page_bits_t; +#elif PAGE_SIZE == 8192 +#define VM_PAGE_BITS_ALL 0xu +typedef uint16_t vm_page_bits_t; +#elif PAGE_SIZE == 16384 +#define VM_PAGE_BITS_ALL 0xu +typedef uint32_t vm_page_bits_t; +#elif PAGE_SIZE == 32768 +#define VM_PAGE_BITS_ALL 0xlu +typedef uint64_t vm_page_bits_t; +#endif + + struct vm_page { TAILQ_ENTRY(vm_page) pageq; /* queue info for FIFO queue or free list (Q) */ TAILQ_ENTRY(vm_page) listq; /* pages in same object (O) */ @@ -138,19 +153,8 @@ struct vm_page { /* NOTE that these must support one bit per DEV_BSIZE in a page!!! */ /* so, on normal X86 kernels, they must be at least 8 bits wide */ /* In reality, support for 32KB pages is not fully implemented. */ -#if PAGE_SIZE == 4096 - uint8_t valid; /* map of valid DEV_BSIZE chunks (O) */ - uint8_t dirty; /* map of dirty
Re: In-kernel API for tasks, which could wait?
On Sun, Oct 30, 2011 at 06:54:51PM +0400, Lev Serebryakov wrote: So, I have question: what should I do if I need to perofrm ONE action, which could block for some time (for example, open file or create ALQ)? I could create thread for this. But it looks strange and too heavy: create thread to call one function once. Maybe, kernel has some API to postpone such task to one, always-running idle thread? See taskqueue(9). On the other hand, waiting for the enqueued task to finish is itself the sleepable action. pgpei4uyxuJ1X.pgp Description: PGP signature
x86: how to get maximum possible CPU frequency ?
Assume we are running on the single-package X86 machine, how to answer the question: what is the possible maximum tsc frequency ? I read tsc_levels_changed(), is it the right way to query the max frequency for the general purpose driver ? If yes, could the code be made into the utility function ? BTW, there is some usage of the barriers in sysctl_machdep_tsc_freq() that I do not understand. When the read from tsc_freq is performed with the acquire semantic ? Why there are two store barriers when tsc_freq and tsc_timecounter.tc_frequency are written ? I would think that a single barrier on the last write is enough, but is it needed at all ? pgpG9zwXR7qbc.pgp Description: PGP signature
Re: getting the cpuid for a userspace process ?
On Tue, Oct 25, 2011 at 01:42:45PM -0400, John Baldwin wrote: On Tuesday, October 25, 2011 11:06:22 am Luigi Rizzo wrote: as the subject says... is there any way to get the current CPU id for a userspace process (of course, valid only at the time the function is called as the process might be arbitrarily moved while it runs) Not from userland, no. On x86 you can use cpuid to fetch the APIC ID, but that does not map 1:1 to FreeBSD cpu IDs. Not quite so. The kern.proc sysctls do provide oncpu and lastcpu information, which, I believe, is used by top. But this is very slow way to get cpu id. pgpjMU1cp088f.pgp Description: PGP signature
Re: 9.0 RC1/Clang / illegal instruction (Signal 4) in gengtype while building cc_tools on i586.
On Sun, Oct 23, 2011 at 10:24:12AM +0200, Roman Divacky wrote: Program received signal SIGILL, Illegal instruction. 0x08048b24 in do_typedef (s=0x80532bf CUMULATIVE_ARGS, pos=0x805e1a4) at /usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcc/gengtype.c:103 103 { (gdb) disas 0x08048b24 Dump of assembler code for function do_typedef: 0x08048b10 do_typedef+0: push %ebp 0x08048b11 do_typedef+1: mov%esp,%ebp 0x08048b13 do_typedef+3: push %ebx 0x08048b14 do_typedef+4: push %edi 0x08048b15 do_typedef+5: push %esi 0x08048b16 do_typedef+6: sub$0xc,%esp 0x08048b19 do_typedef+9: mov$0x805e1d4,%edi 0x08048b1e do_typedef+14: mov0x10(%ebp),%esi 0x08048b21 do_typedef+17: mov0x8(%ebp),%ebx 0x08048b24 do_typedef+20: nopw %cs:0x0(%eax,%eax,1) LLVM attempts to use an optimal nop sequence when writing N-byte nop, by using these nop instructions static const uint8_t Nops[10][10] = { // nop {0x90}, // xchg %ax,%ax {0x66, 0x90}, // nopl (%[re]ax) {0x0f, 0x1f, 0x00}, // nopl 0(%[re]ax) {0x0f, 0x1f, 0x40, 0x00}, // nopl 0(%[re]ax,%[re]ax,1) {0x0f, 0x1f, 0x44, 0x00, 0x00}, // nopw 0(%[re]ax,%[re]ax,1) {0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00}, // nopl 0L(%[re]ax) {0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00}, // nopl 0L(%[re]ax,%[re]ax,1) {0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00}, // nopw 0L(%[re]ax,%[re]ax,1) {0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00}, // nopw %cs:0L(%[re]ax,%[re]ax,1) {0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00}, }; There's no checking for a supported CPU, is it so that AMD geode doesnt support any of these? Any other cpu that doesnt support these? If this is CPU dependant, I suggest to open a bug report upstream as it's a bug. Long nops are supported only on specific CPUs. Unconditional use of them is a plain bug, like unconditional use of cmovXX. pgpXKuJPtriZD.pgp Description: PGP signature
Re: 9.0-RC1 panic in tcp_input: negative winow.
On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote: On Sun, Oct 23, 2011 at 12:35:15PM +1100, Lawrence Stewart wrote: On 10/22/11 19:49, Pawel Jakub Dawidek wrote: The panic message says: panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 3718269252 rcv_adv 3718268291 I only have picture of the backtrace: http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg ewww that is not good. Can you give us any more information about the machine and what it's doing? Is it terminating TCP connections from the internet at large or only local LAN (i.e. is there likely to be packet loss happening)? Are you doing TSO or LRO? Do you have any non-default tuning in place? It is my local file server. It is doing NFS and AFP over LAN and also downloads files from the internet. It is triggered after few hours. I changed the KASSERT() into printf() and added printing 'win' variable and this is what got logged during the night: 05:16:24 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107827269 rcv_adv 1107826256 win=242 05:16:29 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107833451 rcv_adv 1107832977 win=880 05:16:41 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107849563 rcv_adv 1107848860 win=639 05:20:02 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108108230 rcv_adv 1108107331 win=567 05:24:30 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108433302 rcv_adv 1108432272 win=974 05:24:46 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108450385 rcv_adv 1108450060 win=751 05:26:44 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108574818 rcv_adv 1108573851 win=71 05:28:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108654103 rcv_adv 1108653166 win=0 05:28:43 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108692396 rcv_adv 1108691451 win=0 05:30:06 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108781258 rcv_adv 1108780372 win=235 05:35:05 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109067578 rcv_adv 1109067335 win=663 05:37:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109180403 rcv_adv 1109179411 win=0 05:41:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109428265 rcv_adv 1109427375 win=170 And the systems seems to be fine. I'm happy to test patches, but one round would take 24h. My suggestion would be that if we won't be able to fix it before 9.0, we should turn this assertion off, as the system seems to be able to recover. Shipped kernels have all assertions turned off. pgpKtmVl4jzPD.pgp Description: PGP signature
Re: [RFC] Enable nxstack by default
On Mon, Oct 17, 2011 at 09:30:56PM +0200, Oliver Pinter wrote: Hi all! I think, it's the time to enable the nxstack feature. Any comments, pros, cons? I dragged the change long enough for it to miss the 9.0. After the 9.0 is released, I will flip the switch with the following change. diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c index 8455f48..926fe64 100644 --- a/sys/kern/imgact_elf.c +++ b/sys/kern/imgact_elf.c @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0; SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, elf_legacy_coredump, 0, ); -static int __elfN(nxstack) = 0; +int __elfN(nxstack) = +#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ + 1; +#else + 0; +#endif SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO, nxstack, CTLFLAG_RW, __elfN(nxstack), 0, __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable stack); diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c index 7500462..0e27351 100644 --- a/sys/powerpc/aim/mmu_oea64.c +++ b/sys/powerpc/aim/mmu_oea64.c @@ -1445,6 +1445,8 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) return (void *)va; } +extern int elf32_nxstack; + void moea64_init(mmu_t mmu) { @@ -1464,6 +1466,8 @@ moea64_init(mmu_t mmu) uma_zone_set_allocf(moea64_mpvo_zone,moea64_uma_page_alloc); } + elf32_nxstack = 1; + moea64_initialized = TRUE; } diff --git a/sys/powerpc/booke/machdep.c b/sys/powerpc/booke/machdep.c index c2b5e6f..82a37e1 100644 --- a/sys/powerpc/booke/machdep.c +++ b/sys/powerpc/booke/machdep.c @@ -192,6 +192,8 @@ void print_kernel_section_addr(void); void print_kenv(void); u_int booke_init(uint32_t, uint32_t); +extern int elf32_nxstack; + static void cpu_e500_startup(void *dummy) { @@ -227,6 +229,9 @@ cpu_e500_startup(void *dummy) /* Set up buffers, so they can be used to read disk labels. */ bufinit(); vm_pager_bufferinit(); + + /* Cpu supports execution permissions on the pages. */ + elf32_nxstack = 1; } static char * pgpYBjVPeBl7n.pgp Description: PGP signature
Re: freebsd-9.0 smartmontools and ada devices
On Tue, Oct 18, 2011 at 11:02:42AM +0200, John Hay wrote: On Tue, Oct 18, 2011 at 09:39:24AM +0200, John Hay wrote: Hi Guys, I have upgraded my desktop from 8.2-stable to 9.0-RC1 (from source), using a GENERIC kernel. I have installed the smartmontools-5.41_3 package from a mirror and found that smartmontools does not like the ada devices anymore. Previously (8.2) I had a GENERIC kernel, with ahci loaded in loader.conf. There an older smartmontools (5.40) worked without a problem on the ada devices. The output of smartctl looks like this: # dolphin# smartctl -a /dev/ada0 smartctl 5.41 2011-06-09 r3365 [FreeBSD 9.0-RC1 amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net error sending CAMIOCOMMAND ioctl: Inappropriate ioctl for device Unable to get CAM device list /dev/ada0: Unable to detect device type Smartctl: please specify device type with the -d option. Use smartctl -h to get a usage summary # Just to follow up on myself. :-( I have build smartmontools from ports and even though it is the same version, it works. So for me the package amd64/packages-9-current/All/smartmontools-5.41_3.tbz did not work, but the port does. CAM ABI was changed right before RC1. The issue was mentioned in the Ken' announcement. The packages were obviously built with the old headers. pgpCP3eHnNxYY.pgp Description: PGP signature
Re: [RFC] Enable nxstack by default
On Tue, Oct 18, 2011 at 01:06:27PM -0400, Arnaud Lacombe wrote: Hi, On Tue, Oct 18, 2011 at 12:53 PM, Oliver Pinter oliver.p...@gmail.com wrote: On 10/18/11, Arnaud Lacombe lacom...@gmail.com wrote: Hi, On Tue, Oct 18, 2011 at 11:44 AM, Garrett Cooper yaneg...@gmail.com wrote: On Tue, 18 Oct 2011, Arnaud Lacombe wrote: Hi, On Tue, Oct 18, 2011 at 5:07 AM, Kostik Belousov kostik...@gmail.com wrote: On Mon, Oct 17, 2011 at 09:30:56PM +0200, Oliver Pinter wrote: Hi all! I think, it's the time to enable the nxstack feature. Any comments, pros, cons? I dragged the change long enough for it to miss the 9.0. After the 9.0 is released, I will flip the switch with the following change. diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c index 8455f48..926fe64 100644 --- a/sys/kern/imgact_elf.c +++ b/sys/kern/imgact_elf.c @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0; SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, elf_legacy_coredump, 0, ); -static int __elfN(nxstack) = 0; +int __elfN(nxstack) = +#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ Why leaving 32bits x86 CPU supporting the NX feature behind ? Most likely because it was assumed that i386 doesn't fully support it. According to ye great Wikipedia, NX support didn't roll into i386 until Prescott, which was pretty late in the non-64-bit capable family of CPUs, as its successor -- Conroe -- was 64-bit. Intel detuned some of the early Dual Core Pentiums, e.g. the Yonahs to not talk 64-bit. Not sure about AMD. There are probably more details in binutils, gcc, etc, that I'm missing and Kostik can expound on. NX support is advertised in the cpuid flags, just add the logic to handle this interface. Kostik's patch is just incomplete, but he's got a commit bit so he can commit it as-is, as he will. If nonexec_stack becomes the default, it should be on every CPU supporting the feature, not just the low-hanging one. - Arnaud the NX detection code already implemented in i386, but this feature required PAE: yes, this is the conclusion I reached too. But this does not change the fact that the VM should know about that, and this is missing from Kostik's patch. I guess the first hunk should read: @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0; SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, elf_legacy_coredump, 0, ); -static int __elfN(nxstack) = 0; +int __elfN(nxstack) = +#if defined(PAE) || defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ + 1; +#else + 0; +#endif SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO, nxstack, CTLFLAG_RW, __elfN(nxstack), 0, __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable stack); Your patch is not right, it will cause even more user confusion. The presence of the PAE kernel does not imply that CPU supports nx. Below is the updated patch that turns on nxstack by default for the PAE kernels on NX-capable CPUs. Note that i386 usermode fully supports the PT_GNU_STACK annotations and cares about not enabling nx on stack pages unneccessary, but my main target was compat32 on amd64. The fact that nxstack is not enabled by default does not prevent administrator from manually enabling the feature. Since you worried so much about PAE case, I sincerely expect that you will test the change. Thank you in advance. diff --git a/sys/i386/i386/initcpu.c b/sys/i386/i386/initcpu.c index c2daf54..ec77adb 100644 --- a/sys/i386/i386/initcpu.c +++ b/sys/i386/i386/initcpu.c @@ -650,6 +650,8 @@ enable_sse(void) #endif } +extern int elf32_nxstack; + void initializecpu(void) { @@ -739,6 +741,7 @@ initializecpu(void) msr = rdmsr(MSR_EFER) | EFER_NXE; wrmsr(MSR_EFER, msr); pg_nx = PG_NX; + elf32_nxstack = 1; } #endif break; diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c index 8455f48..926fe64 100644 --- a/sys/kern/imgact_elf.c +++ b/sys/kern/imgact_elf.c @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0; SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, elf_legacy_coredump, 0, ); -static int __elfN(nxstack) = 0; +int __elfN(nxstack) = +#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */ + 1; +#else + 0; +#endif SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO, nxstack, CTLFLAG_RW, __elfN(nxstack), 0, __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable stack); diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c index 7500462..0e27351 100644 --- a/sys/powerpc/aim/mmu_oea64.c +++ b/sys/powerpc/aim/mmu_oea64.c @@ -1445,6 +1445,8 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t
Re: pmap_qenter() - the page *must* be wired - is violated
On Thu, Oct 06, 2011 at 01:45:16PM +0200, Svatopluk Kraus wrote: On Wed, Oct 5, 2011 at 4:53 PM, Kostik Belousov kostik...@gmail.com wrote: On Wed, Oct 05, 2011 at 02:28:01PM +0200, Svatopluk Kraus wrote: Hi, I found out that on a few places pmap_qenter() is called on pages which are not wired. For example, in the following functions, when vm_pager_get_pages() is called, the pages are not wired: exec_map_first_page() in sys/kern/kern_exec.c vm_fault_hold() in sys/vm/vm_fault.c vm_imgact_hold_page() in sys/vm/vm_glue.c vm_object_populate() in sys/vm/vm_object.c mdstart_swap() in sys/dev/md/md.c Is the rule violated or the rule should be changed? Lets first discuss where did you found the calls to pmap_qenter(). Can you point out exact line numbers of the calls to pmap_qenter() that you consider problematic ? In fact, the requirement probably shall be 'no swapout allowed'. E.g., the busy page is fully qualified to be used together with pmap_qenter(). Well, I just follow description above pmap_qenter() blindly and test for page wire_count inside the function. All function calls, I mentioned before, are OK as pages are VPO_BUSY'd. Thanks for your explanation. Still, I am not aware of any (direct) calls to pmap_qenter in the mentioned functions. pgpY4rX1RvV03.pgp Description: PGP signature
Re: cvsup broken on amd64?
On Wed, Oct 05, 2011 at 03:21:45PM -0700, David O'Brien wrote: On Fri, Sep 09, 2011 at 06:00:02PM +0300, Kostik Belousov wrote: --- libs/m3core/src/thread/POSIX/ThreadPosix.m3.orig2011-09-09 17:58:12.867431639 +0300 +++ libs/m3core/src/thread/POSIX/ThreadPosix.m3 2011-09-09 17:58:30.380428486 +0300 @@ -180,7 +180,7 @@ pausedThreads : T; selected_interval:= UTime{0, 100 * 1000}; - defaultStackSize := 3000; + defaultStackSize := 1; This might not be a large enough value (depending on the unit of measure). I synced tzdata+tzcode at $WORK and we found the amount of stack used by tzload() alone is now quite large -- 41k on ARM. Please see r225677. pgpjpUkg0NdSp.pgp Description: PGP signature
Re: pmap_qenter() - the page *must* be wired - is violated
On Wed, Oct 05, 2011 at 02:28:01PM +0200, Svatopluk Kraus wrote: Hi, I found out that on a few places pmap_qenter() is called on pages which are not wired. For example, in the following functions, when vm_pager_get_pages() is called, the pages are not wired: exec_map_first_page() in sys/kern/kern_exec.c vm_fault_hold() in sys/vm/vm_fault.c vm_imgact_hold_page() in sys/vm/vm_glue.c vm_object_populate() in sys/vm/vm_object.c mdstart_swap() in sys/dev/md/md.c Is the rule violated or the rule should be changed? Lets first discuss where did you found the calls to pmap_qenter(). Can you point out exact line numbers of the calls to pmap_qenter() that you consider problematic ? In fact, the requirement probably shall be 'no swapout allowed'. E.g., the busy page is fully qualified to be used together with pmap_qenter(). pgpbn7etyrLfS.pgp Description: PGP signature
Re: st_dev and st_ino for pipes
On Tue, Oct 04, 2011 at 05:37:27PM +1100, Peter Jeremy wrote: On 2011-Oct-03 01:04:05 +0300, Kostik Belousov kostik...@gmail.com wrote: Our implementation of pipes does not provide useful values for st_dev and st_ino when stat(2) is done on an anonymous pipe. It was noted by the ... Patch below implements the requirement, by the cost of the small overhead at the pipe creation time, and slightly bigger cost at the destruction. Does it need to be so complex? This information isn't needed by the kernel and, to be meaningful, all that is required is that the (st_dev,st_ino) pair is unique within the system. Given this, wouldn't it be sufficient to fake up a st_dev and then just make st_ino be a counter that starts from 0 and increments (atomically?) on every new pipe? No need to retain state or free anything when the pipe is destroyed. (If necessary, pick a new fake st_dev when st_ino wraps). Yes, you described the problem. The (st_dev, st_ino) pair must be globally unique in system, not only for the pipes, but for whole domain of file descriptors. This is the reason that I allocate a value for pipe st_dev using the same devfs mechanism as it is done for device numbers. Using simple incrementing counter for pipe inodes, together with allocating yet another st_dev gives essentially the same complications wrt KPI, and feels unclean. --- a/sys/kern/sys_pipe.c +++ b/sys/kern/sys_pipe.c ... +static ino_t pipedev_ino; .. +ub-st_dev = pipedev_ino; st_dev is a dev_t and hence pipedev_ino (which seems misnamed to me) should probably be dev_t rather than ino_t Fixed. Thank you. pgpvSZLW4YilD.pgp Description: PGP signature
Re: x220 notes
On Tue, Oct 04, 2011 at 12:26:27PM -0700, matt wrote: On Tue, 4 Oct 2011 02:21:59 -0400 (EDT) Benjamin Kaduk ka...@mit.edu wrote: On Mon, 3 Oct 2011, matt wrote: Ultimately, I think if we can set backlight, we can fix the screen after resume...I think it's just the backlight is low/off on resume... Can you use a very strong fronglight to see if this is actually the case? (Illumination angle may be important as well.) The state of the liquid crystal polarizing switches should be visible under sufficient illumination, and whether they change with input would be pretty definitive. I actually used to operate my iBook G4 in frontlight mode in bright sunlight for a while (until the wifi card flaked out, tying it to my desk). -Ben Kaduk It's the IPS display, so there's a bit of glare...not much visible. I can't tell for sure... Friend had idea of trying to resume under X with external VGA display...both remained off, however I am using the Intel GPU experimental drivers (To make a note, resume is broken without them as well). Suspend/resume is not implemented in the KMS driver (yet). I can't get vbetool from ports to do anything with the display other than cause it to turn off. I can type commands just fine under X or console with the display off...so I have tried xset, vbetool. hw.acpi.reset_video causes a majorly bad reboot loop on resume with a flashing thinklight. Matt ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org pgpFzmFrenynw.pgp Description: PGP signature
Re: x220 notes
On Tue, Oct 04, 2011 at 02:38:06PM -0700, matt wrote: Friend had idea of trying to resume under X with external VGA display...both remained off, however I am using the Intel GPU experimental drivers (To make a note, resume is broken without them as well). Suspend/resume is not implemented in the KMS driver (yet). On x220 problem is deeper than KMS, since it exists without KMS patches applied. Thank you for graphics support! Even compiz works, although second output stuff is overlapped when extending on this No idea what do you mean. one...also lots of update_fbc, but this is probably known. Write 0 into sysctl hw.dri.debug to get rid of KMS debugging output. There are so far about 4 patches required agains HEAD for this machine to be usable...KMS, acpi-ibm patches, iwn patches, and currently newvers.sh. I have done testing with all these rolled back, problem is still there...I think it could be UEFI/ACPI related? vbetool, dpms.ko, acpi_video, none can talk to the monitor. Does anyone know if system/6636 was fixed for OpenBSD? Matt pgpmwf2l1fljv.pgp Description: PGP signature
st_dev and st_ino for pipes
Our implementation of pipes does not provide useful values for st_dev and st_ino when stat(2) is done on an anonymous pipe. It was noted by the people outside the project, e.g. Perl contains a workaround in one of its modules, submitted by Debian/kFreeBSD developers, see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=537555 and the commit 16f708c9bc0dc48713b200 in the Perl git. I think this is a non-conformance, since SUSv4 explicitely states in the description of stat(2) For all other file types defined in this volume of POSIX.1-2008, the structure members st_mode, st_ino, st_dev, st_uid, st_gid, st_atim, st_ctim, and st_mtim shall have meaningful values Patch below implements the requirement, by the cost of the small overhead at the pipe creation time, and slightly bigger cost at the destruction. Any comments ? diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c index a2d3222..a111527 100644 --- a/sys/fs/devfs/devfs_devs.c +++ b/sys/fs/devfs/devfs_devs.c @@ -171,8 +171,7 @@ devfs_free(struct cdev *cdev) cdp = cdev2priv(cdev); if (cdev-si_cred != NULL) crfree(cdev-si_cred); - if (cdp-cdp_inode 0) - free_unr(devfs_inos, cdp-cdp_inode); + devfs_free_cdp_inode(cdp-cdp_inode); if (cdp-cdp_maxdirent 0) free(cdp-cdp_dirents, M_DEVFS2); free(cdp, M_CDEVP); @@ -394,7 +393,7 @@ devfs_delete(struct devfs_mount *dm, struct devfs_dirent *de, int flags) mac_devfs_destroy(de); #endif if (de-de_inode DEVFS_ROOTINO) { - free_unr(devfs_inos, de-de_inode); + devfs_free_cdp_inode(de-de_inode); de-de_inode = 0; } if (DEVFS_DE_DROP(de)) @@ -685,6 +684,21 @@ devfs_destroy(struct cdev *dev) devfs_generation++; } +ino_t +devfs_alloc_cdp_inode(void) +{ + + return (alloc_unr(devfs_inos)); +} + +void +devfs_free_cdp_inode(ino_t ino) +{ + + if (ino 0) + free_unr(devfs_inos, ino); +} + static void devfs_devs_init(void *junk __unused) { diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c index b7ae521..2edecf2 100644 --- a/sys/kern/sys_pipe.c +++ b/sys/kern/sys_pipe.c @@ -93,6 +93,7 @@ __FBSDID($FreeBSD$); #include sys/param.h #include sys/systm.h +#include sys/conf.h #include sys/fcntl.h #include sys/file.h #include sys/filedesc.h @@ -224,6 +225,8 @@ static int pipe_zone_init(void *mem, int size, int flags); static voidpipe_zone_fini(void *mem, int size); static uma_zone_t pipe_zone; +static struct unrhdr *pipeino_unr; +static ino_t pipedev_ino; SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_ANY, pipeinit, NULL); @@ -235,6 +238,10 @@ pipeinit(void *dummy __unused) pipe_zone_ctor, NULL, pipe_zone_init, pipe_zone_fini, UMA_ALIGN_PTR, 0); KASSERT(pipe_zone != NULL, (pipe_zone not initialized)); + pipeino_unr = new_unrhdr(1, INT32_MAX, NULL); + KASSERT(pipeino_unr != NULL, (pipe fake inodes not initialized)); + pipedev_ino = devfs_alloc_cdp_inode(); + KASSERT(pipedev_ino 0, (pipe dev inode not initialized)); } static int @@ -562,6 +569,12 @@ pipe_create(pipe, backing) /* If we're not backing this pipe, no need to do anything. */ error = 0; } + if (error == 0) { + pipe-pipe_ino = alloc_unr(pipeino_unr); + if (pipe-pipe_ino == -1) + /* pipeclose will clear allocated kva */ + error = ENOMEM; + } return (error); } @@ -1408,9 +1421,10 @@ pipe_stat(fp, ub, active_cred, td) ub-st_ctim = pipe-pipe_ctime; ub-st_uid = fp-f_cred-cr_uid; ub-st_gid = fp-f_cred-cr_gid; + ub-st_dev = pipedev_ino; + ub-st_ino = pipe-pipe_ino; /* -* Left as 0: st_dev, st_ino, st_nlink, st_rdev, st_flags, st_gen. -* XXX (st_dev, st_ino) should be unique. +* Left as 0: st_nlink, st_rdev, st_flags, st_gen. */ return (0); } @@ -1463,6 +1477,7 @@ pipeclose(cpipe) { struct pipepair *pp; struct pipe *ppipe; + ino_t ino; KASSERT(cpipe != NULL, (pipeclose: cpipe == NULL)); @@ -1521,6 +1536,12 @@ pipeclose(cpipe) knlist_destroy(cpipe-pipe_sel.si_note); /* +* Postpone the destroy of the fake inode number allocated for +* our end, until pipe mtx is unlocked. +*/ + ino = cpipe-pipe_ino; + + /* * If both endpoints are now closed, release the memory for the * pipe pair. If not, unlock. */ @@ -1532,6 +1553,9 @@ pipeclose(cpipe) uma_zfree(pipe_zone, cpipe-pipe_pair); } else PIPE_UNLOCK(cpipe); + + if (ino 0) + free_unr(pipeino_unr, cpipe-pipe_ino); } /*ARGSUSED*/ diff --git a/sys/sys/conf.h b/sys/sys/conf.h index 08e1582..7acd2e0 100644 --- a/sys/sys/conf.h +++ b/sys/sys/conf.h @@
Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote: Hello! I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports (I've upgraded from 8.2). I'm getting panic. Is it known problem/already fixed somewhere? FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28 17:11:17 EEST 2011 r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9 i386 Unread portion of the kernel message buffer: panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero cpuid = 1 Uptime: 16h6m53s Physical memory: 1904 MB Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 #0 doadump (textdump=1) at pcpu.h:244 #1 0xc071e5cb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:442 #2 0xc071e82b in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:607 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 Please do frame 2, then p/x *m and show the result. #4 0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4) at /usr/src/sys/kern/vfs_bio.c:1638 #5 0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0, size=16384, maxsize=16384, gbflags=0) at /usr/src/sys/kern/vfs_bio.c:1949 #6 0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0, slptimeo=0, flags=Variable flags is not available. ) at /usr/src/sys/kern/vfs_bio.c:2788 #7 0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, lbn=2520, blkno=1209440, size=16384, run=Variable run is not available. ) at /usr/src/sys/kern/vfs_cluster.c:332 #8 0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088, lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7, bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254 #9 0xc0934cf5 in ffs_read (ap=0xf5824bac) at /usr/src/sys/ufs/ffs/ffs_vnops.c:514 #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac) at vnode_if.c:887 #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48, active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384 #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508, auio=0xf5824c48, offset=41189376, flags=1) at file.h:254 #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48, offset=41189376) at /usr/src/sys/kern/sys_generic.c:288 #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec) at /usr/src/sys/kern/sys_generic.c:189 #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131 #16 0xc0996db1 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:266 #17 0x0033 in ?? () Previous frame inner to this frame (corrupt stack?) -- Alexandr Kovalenko http://uafug.org.ua/ ___ freebsd-sta...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org pgpcZFnkVkBAg.pgp Description: PGP signature
Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote: On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com wrote: On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote: Hello! I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports (I've upgraded from 8.2). I'm getting panic. Is it known problem/already fixed somewhere? FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28 17:11:17 EEST 2011 r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9 i386 Unread portion of the kernel message buffer: panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero cpuid = 1 Uptime: 16h6m53s Physical memory: 1904 MB Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 #0 doadump (textdump=1) at pcpu.h:244 #1 0xc071e5cb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:442 #2 0xc071e82b in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:607 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 Please do frame 2, then p/x *m and show the result. (kgdb) frame 2 frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well. #2 0xc071e82b in panic (fmt=Variable fmt is not available.) at /usr/src/sys/kern/kern_shutdown.c:607 607 kern_reboot(bootopt); (kgdb) p/x *m No symbol m in current context. #4 0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4) at /usr/src/sys/kern/vfs_bio.c:1638 #5 0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0, size=16384, maxsize=16384, gbflags=0) at /usr/src/sys/kern/vfs_bio.c:1949 #6 0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0, slptimeo=0, flags=Variable flags is not available. ) at /usr/src/sys/kern/vfs_bio.c:2788 #7 0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, lbn=2520, blkno=1209440, size=16384, run=Variable run is not available. ) at /usr/src/sys/kern/vfs_cluster.c:332 #8 0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088, lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7, bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254 #9 0xc0934cf5 in ffs_read (ap=0xf5824bac) at /usr/src/sys/ufs/ffs/ffs_vnops.c:514 #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac) at vnode_if.c:887 #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48, active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384 #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508, auio=0xf5824c48, offset=41189376, flags=1) at file.h:254 #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48, offset=41189376) at /usr/src/sys/kern/sys_generic.c:288 #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec) at /usr/src/sys/kern/sys_generic.c:189 #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131 #16 0xc0996db1 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:266 #17 0x0033 in ?? () Previous frame inner to this frame (corrupt stack?) -- Alexandr Kovalenko http://uafug.org.ua/ ___ freebsd-sta...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Alexandr Kovalenko http://uafug.org.ua/ pgplpD93nm6tL.pgp Description: PGP signature
Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
On Thu, Sep 29, 2011 at 03:51:53PM +0300, Alexandr Kovalenko wrote: 2011/9/29 Kostik Belousov kostik...@gmail.com: On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote: On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com wrote: On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote: Hello! I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports (I've upgraded from 8.2). I'm getting panic. Is it known problem/already fixed somewhere? Do you use custom kernel config ? Is there a chance you have ZERO_COPY_SOCKETS option enabled ? FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28 17:11:17 EEST 2011 r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9 i386 Unread portion of the kernel message buffer: panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero cpuid = 1 Uptime: 16h6m53s Physical memory: 1904 MB Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 #0 doadump (textdump=1) at pcpu.h:244 #1 0xc071e5cb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:442 #2 0xc071e82b in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:607 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 Please do frame 2, then p/x *m and show the result. (kgdb) frame 2 frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well. (kgdb) frame 3 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 1905panic(vm_page_unwire: page %p's wire count is zero, m); (kgdb) p/x *(struct vm_page *)0xc2a38dc8 $1 = {pageq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b0}, listq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b8}, left = 0x0, right = 0x0, object = 0xc5725770, pindex = 0xbd3, phys_addr = 0x56a32000, md = {pv_list = {tqh_first = 0xc3cc6418, tqh_last = 0xc3cc641c}, pat_mode = 0x6}, queue = 0x1, segind = 0x2, hold_count = 0x0, order = 0xb, pool = 0x0, cow = 0x0, wire_count = 0x0, aflags = 0x3, flags = 0x0, oflags = 0x0, act_count = 0x5, busy = 0x0, valid = 0xff, dirty = 0xff} Please show the output of p *(struct vm_object *)0xc5725770 from kgdb. #2 0xc071e82b in panic (fmt=Variable fmt is not available.) at /usr/src/sys/kern/kern_shutdown.c:607 607 kern_reboot(bootopt); (kgdb) p/x *m No symbol m in current context. #4 0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4) at /usr/src/sys/kern/vfs_bio.c:1638 #5 0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0, size=16384, maxsize=16384, gbflags=0) at /usr/src/sys/kern/vfs_bio.c:1949 #6 0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0, slptimeo=0, flags=Variable flags is not available. ) at /usr/src/sys/kern/vfs_bio.c:2788 #7 0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, lbn=2520, blkno=1209440, size=16384, run=Variable run is not available. ) at /usr/src/sys/kern/vfs_cluster.c:332 #8 0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088, lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7, bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254 #9 0xc0934cf5 in ffs_read (ap=0xf5824bac) at /usr/src/sys/ufs/ffs/ffs_vnops.c:514 #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac) at vnode_if.c:887 #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48, active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384 #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508, auio=0xf5824c48, offset=41189376, flags=1) at file.h:254 #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48, offset=41189376) at /usr/src/sys/kern/sys_generic.c:288 #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec) at /usr/src/sys/kern/sys_generic.c:189 #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131 #16 0xc0996db1 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:266 #17 0x0033 in ?? () Previous frame inner to this frame (corrupt stack?) -- Alexandr Kovalenko http://uafug.org.ua/ ___ freebsd-sta...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Alexandr Kovalenko http://uafug.org.ua/ -- Alexandr Kovalenko http://uafug.org.ua/ pgpPkSqzkhr3F.pgp Description: PGP signature
Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
On Thu, Sep 29, 2011 at 04:12:19PM +0300, Alexandr Kovalenko wrote: 2011/9/29 Kostik Belousov kostik...@gmail.com: On Thu, Sep 29, 2011 at 03:51:53PM +0300, Alexandr Kovalenko wrote: 2011/9/29 Kostik Belousov kostik...@gmail.com: On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote: On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com wrote: On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote: Hello! I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports (I've upgraded from 8.2). I'm getting panic. Is it known problem/already fixed somewhere? Do you use custom kernel config ? Is there a chance you have ZERO_COPY_SOCKETS option enabled ? Yes, ZERO_COPY_SOCKETS is there. Ok, this is the cause. Remove it. I asked for some additional data below, which you ignored, but I believe that I will not see anything new there, after we found the ZERO_COPY_SOCKETS in kernel config. FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28 17:11:17 EEST 2011 r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9 i386 Unread portion of the kernel message buffer: panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero cpuid = 1 Uptime: 16h6m53s Physical memory: 1904 MB Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160 144 128 112 96 80 64 48 32 16 #0 doadump (textdump=1) at pcpu.h:244 #1 0xc071e5cb in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:442 #2 0xc071e82b in panic (fmt=Variable fmt is not available. ) at /usr/src/sys/kern/kern_shutdown.c:607 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 Please do frame 2, then p/x *m and show the result. (kgdb) frame 2 frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well. (kgdb) frame 3 #3 0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at /usr/src/sys/vm/vm_page.c:1905 1905 panic(vm_page_unwire: page %p's wire count is zero, m); (kgdb) p/x *(struct vm_page *)0xc2a38dc8 $1 = {pageq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b0}, listq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b8}, left = 0x0, right = 0x0, object = 0xc5725770, pindex = 0xbd3, phys_addr = 0x56a32000, md = {pv_list = {tqh_first = 0xc3cc6418, tqh_last = 0xc3cc641c}, pat_mode = 0x6}, queue = 0x1, segind = 0x2, hold_count = 0x0, order = 0xb, pool = 0x0, cow = 0x0, wire_count = 0x0, aflags = 0x3, flags = 0x0, oflags = 0x0, act_count = 0x5, busy = 0x0, valid = 0xff, dirty = 0xff} Please show the output of p *(struct vm_object *)0xc5725770 from kgdb. #2 0xc071e82b in panic (fmt=Variable fmt is not available.) at /usr/src/sys/kern/kern_shutdown.c:607 607 kern_reboot(bootopt); (kgdb) p/x *m No symbol m in current context. #4 0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4) at /usr/src/sys/kern/vfs_bio.c:1638 #5 0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0, size=16384, maxsize=16384, gbflags=0) at /usr/src/sys/kern/vfs_bio.c:1949 #6 0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0, slptimeo=0, flags=Variable flags is not available. ) at /usr/src/sys/kern/vfs_bio.c:2788 #7 0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, lbn=2520, blkno=1209440, size=16384, run=Variable run is not available. ) at /usr/src/sys/kern/vfs_cluster.c:332 #8 0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088, lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7, bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254 #9 0xc0934cf5 in ffs_read (ap=0xf5824bac) at /usr/src/sys/ufs/ffs/ffs_vnops.c:514 #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac) at vnode_if.c:887 #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48, active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384 #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508, auio=0xf5824c48, offset=41189376, flags=1) at file.h:254 #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48, offset=41189376) at /usr/src/sys/kern/sys_generic.c:288 #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec) at /usr/src/sys/kern/sys_generic.c:189 #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131 #16 0xc0996db1 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:266 #17 0x0033 in ?? () Previous frame inner to this frame (corrupt stack?) -- Alexandr Kovalenko http://uafug.org.ua/ ___ freebsd-sta...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo
Re: ia64 r225789 panic during make installworld: Bad buffer logic, remain = 0
On Wed, Sep 28, 2011 at 04:27:39PM +0300, Jaakko Heinonen wrote: On 2011-09-28, Anton Shterenlikht wrote: KDB: stack backtrace: getenv with the following non-sleepable locks held: exclusive sleep mutex vnode interlock (vnode interlock) r = 0 (0xe00011950488) locked @ /usr/src/sys/fs/devfs/devfs_vnops.c:406 etc. until a hang, requiring cold reset via MP. Someone is calling getenv with a vnode interlock held. You need to figure out the caller. Unfortunately the backtrace is missing above. As a temporary workaround you could comment the WITNESS_WARN() line in getenv() (sys/kern/kern_environment.c) but it is not a real fix. I do not think that this is the real cause of the panic. Line 406 in devfs_vnops.c belongs to devfs_allocv(), and vnode interlock taken there must be consumed by LK_INTERLOCK call to vget(). The getenv() cannot be called from the vget() or two unlock calls between lines 406 and 409. It seems there is something broken elsewere. pgpP9hKdJwqiy.pgp Description: PGP signature
Re: Choosing between DELAY(useconds) and pause()
On Tue, Sep 27, 2011 at 10:39:44AM -0600, Julian Elischer wrote: On 9/27/11 4:12 AM, Gavin Atkinson wrote: On Mon, 2011-09-26 at 09:30 -0400, John Baldwin wrote: On Friday, September 23, 2011 11:21:06 am Gavin Atkinson wrote: On Thu, 2011-09-22 at 20:07 +0200, Hans Petter Selasky wrote: On Thursday 22 September 2011 19:55:23 David Somayajulu wrote: It appears that the pause() function cannot be used in driver functions which are invoked early in the boot process. Is there is a kernel api which a device driver can use to determine whether to use pause() or DELAY(), for delays which are say greater than 10hz - may be even 1 hz ? Maybe you want to use something like this: if (cold) DELAY() else pause() In your code. Note that this still shouldn't be done in your suspend/resume paths, as cold isn't set there, however there also appears to be no guarantee that pause() will ever return (as you could be running after the timer has been suspended, or before it resumes). I'm not sure what the correct answer is for suspend/resume code. Hmmm, on x86 the timers are explicitly shutdown after the DEVICE_SUSPEND() pass over the tree and re-enabled before DEVICE_RESUME(). Perhaps this has changed in HEAD though with the eventtimers stuff. I do think it is best however, to use DELAY() in the suspend/resume path always regardless. I don't think head is any different from stable/8 in this respect - the same hack patch that fixes suspend/resume for me on head also fixes it on stable/8 (the patch basically fakes cold during USB suspend/resume). See my email to -usb a few months ago: http://docs.FreeBSD.org/cgi/mid.cgi?alpine.LNX.2.00.1106041548370.26975 I'd really like some guidance as to the correct solution to this, I have four separate laptops which fail out of the box on 8 and 9, but suspend/resume perfectly with this hack. code for timers should have a generally readable state that says if they are useable or not, and we should test that instead of 'cold' I think that this should be encapsulated into the usable function, instead of being directly exposed to the driver authors. With the my lack of imagination, I propose driver_delay() that would do if (cold) ... inside. pgp30errILlUx.pgp Description: PGP signature
Re: truss
On Mon, Sep 19, 2011 at 04:03:42PM +, Anton Yuzhaninov wrote: On Mon, 19 Sep 2011 15:00:31 + (UTC), Anton Yuzhaninov wrote: AY On Mon, 19 Sep 2011 15:58:02 +0300, Mikolaj Golub wrote: AY ktrace -i for truss sleep 5 AY http://dl.dropbox.com/u/8798217/tmp/truss_ktrace2.txt MG MG Although ptrace(PT_TRACE_ME,0,0,0) returned 0 the process did not stop after MG execve() and wait4() in parent (which was actually waiting for this stop) MG returned only after the child exit. No I idea why so far :-). MG AY AY As I understand SIGTRAP used to stop child process after execve(), but AY this signal ignored: AY AY citrin:~ sleep 300 AY citrin:~ procstat -i 1991 | fgrep TRAP AY 1991 sleepTRAP -I- AY AY Under FreeBSD 8, where ptrace works for me, this signal is not ignored: AY x:~ sleep 300 AY x:~ procstat -i 78716 | fgrep TRAP AY 78716 sleepTRAP --- SIGTRAP is ignored by X window manager used by me, and this is inherited across forks/execs up to the truss. IMHO truss should restore default signal handler for SIGTRAP. With this patch truss works for me: --- usr.bin/truss/main.c(revision 225504) +++ usr.bin/truss/main.c(working copy) @@ -255,6 +255,11 @@ main(int ac, char **av) if (trussinfo-pid == 0) { /* Start a command ourselves */ command = av; + /* +* SIGTRUP used to stop traced process after execve +* un-ignore this signal (it can be ignored by parents) +*/ + signal(SIGTRAP, SIG_DFL); trussinfo-pid = setup_and_wait(command); signal(SIGINT, SIG_IGN); signal(SIGTERM, SIG_IGN); This is quite a hack. The proper fix should go in kernel, otherwise we cannot debug programs that decided to ignore SIGTRAP. The reason there is that tdsendsignal() does nothing for attempt to deliver ignored signal, and kern_execve() simply tries to send a signal to self. BTW, it seems we might also have similar issues for threads that masks SIGTRAP, now because issignal() does nothing in that case too. But lets handle it by small steps. Could you, please, test the change below ? For me, I still can truss(1) or debug with gdb after the change applied. Does truss work for you with only this change, without resetting SIGTRAP handler in truss process ? commit 2ae586c039a55399edc3b34cd40410e0d690a16c Author: Konstantin Belousov kos...@pooma.home Date: Tue Sep 20 00:25:07 2011 +0300 Do not deliver SIGTRAP on exec as the normal signal, use ptracestop() on syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal() do not want to deliver, and debugger never get a notification of exec. Found by: Anton Yuzhaninov cit...@citrin.ru diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c index fe01142..4545848 100644 --- a/sys/kern/kern_exec.c +++ b/sys/kern/kern_exec.c @@ -777,16 +777,6 @@ interpret: KNOTE_LOCKED(p-p_klist, NOTE_EXEC); p-p_flag = ~P_INEXEC; - /* -* If tracing the process, trap to the debugger so that -* breakpoints can be set before the program executes. We -* have to use tdsignal() to deliver the signal to the current -* thread since any other threads in this process will exit if -* execve() succeeds. -*/ - if (p-p_flag P_TRACED) - tdsignal(td, SIGTRAP); - /* clear fork but no exec flag, as we _are_ execing */ p-p_acflag = ~AFORK; diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c index cb0d929..bba4479 100644 --- a/sys/kern/subr_syscall.c +++ b/sys/kern/subr_syscall.c @@ -204,9 +204,17 @@ syscallret(struct thread *td, int error, struct syscall_args *sa __unused) * is not the case, this code will need to be revisited. */ STOPEVENT(p, S_SCX, sa-code); - PTRACESTOP_SC(p, td, S_PT_SCX); if (traced || (td-td_dbgflags (TDB_EXEC | TDB_FORK)) != 0) { PROC_LOCK(p); + /* +* If tracing the execed process, trap to the debugger +* so that breakpoints can be set before the program +* executes. If debugger requested tracing of syscall +* returns, do it now too. +*/ + if (traced ((td-td_dbgflags TDB_EXEC) != 0 || + (p-p_stops S_PT_SCX) != 0)) + ptracestop(td, SIGTRAP); td-td_dbgflags = ~(TDB_SCX | TDB_EXEC | TDB_FORK); PROC_UNLOCK(p); } pgpF2yup3qKFJ.pgp Description: PGP signature
Re: cvsup broken on amd64?
On Sun, Sep 18, 2011 at 12:22:53PM +0200, Oliver Lehmann wrote: Adrian Chadd adr...@freebsd.org wrote: So I've taken a look at the csup source. [...] What about this patch: [...] Oliver, would you please try that? I have a problem with cvsup, not csup - Alexander mentioned a csup problem. Did you saw the message with the patch for tzcode I mailed to you ? pgpxG7jeO8J6E.pgp Description: PGP signature
Re: Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW)
On Sun, Sep 18, 2011 at 01:56:50PM +0200, Jilles Tjoelker wrote: On Wed, Sep 14, 2011 at 11:04:56PM +0300, Kostik Belousov wrote: tzload() allocates ~80KB for the local variables. The backtrace you provided shows the nested call to tzload(), so there is total 160KB of the stack space consumed. By default, stack for the amd64 thread is 4MB, that should be plenty. This is not the case for ezm3. Possibly, stunnel also reduces the size of the thread stack. Please, try the patch below. I did not tested it, only compiled. I see that now tzload allocates only ~300 bytes on the stack. 80KB seems quite a lot indeed, good to bring it down. diff --git a/contrib/tzcode/stdtime/localtime.c b/contrib/tzcode/stdtime/localtime.c index 80b70ac..55d55e0 100644 --- a/contrib/tzcode/stdtime/localtime.c +++ b/contrib/tzcode/stdtime/localtime.c [snip] @@ -406,16 +409,24 @@ register const intdoextend; ** to hold the longest file name string that the implementation ** guarantees can be opened. */ - charfullname[FILENAME_MAX + 1]; + char*fullname; + + fullname = malloc(FILENAME_MAX + 1); + if (fullname == NULL) + goto out; if (name[0] == ':') ++name; doaccess = name[0] == '/'; if (!doaccess) { - if ((p = TZDIR) == NULL) + if ((p = TZDIR) == NULL) { + free(fullname); return -1; - if ((strlen(p) + 1 + strlen(name) + 1) = sizeof fullname) + } + if ((strlen(p) + 1 + strlen(name) + 1) = sizeof fullname) { This sizeof is now the sizeof of a pointer. The comparison should be against FILENAME_MAX + 1 instead. Alternatively, the name could be created using asprintf(). You are right. I fixed the defect. Updated patch below. diff --git a/contrib/tzcode/stdtime/localtime.c b/contrib/tzcode/stdtime/localtime.c index 80b70ac..b1981b6 100644 --- a/contrib/tzcode/stdtime/localtime.c +++ b/contrib/tzcode/stdtime/localtime.c @@ -380,13 +380,16 @@ register const intdoextend; int fid; int stored; int nread; + int res; union { struct tzhead tzhead; charbuf[2 * sizeof(struct tzhead) + 2 * sizeof *sp + 4 * TZ_MAX_TIMES]; - } u; + } *u; + u = NULL; + res = -1; sp-goback = sp-goahead = FALSE; /* XXX The following is from OpenBSD, and I'm not sure it is correct */ @@ -406,16 +409,24 @@ register const intdoextend; ** to hold the longest file name string that the implementation ** guarantees can be opened. */ - charfullname[FILENAME_MAX + 1]; + char*fullname; + + fullname = malloc(FILENAME_MAX + 1); + if (fullname == NULL) + goto out; if (name[0] == ':') ++name; doaccess = name[0] == '/'; if (!doaccess) { - if ((p = TZDIR) == NULL) + if ((p = TZDIR) == NULL) { + free(fullname); return -1; - if ((strlen(p) + 1 + strlen(name) + 1) = sizeof fullname) + } + if (strlen(p) + 1 + strlen(name) = FILENAME_MAX) { + free(fullname); return -1; + } (void) strcpy(fullname, p); (void) strcat(fullname, /); (void) strcat(fullname, name); @@ -426,37 +437,45 @@ register const intdoextend; doaccess = TRUE; name = fullname; } - if (doaccess access(name, R_OK) != 0) + if (doaccess access(name, R_OK) != 0) { + free(fullname); return -1; - if ((fid = _open(name, OPEN_MODE)) == -1) + } + if ((fid = _open(name, OPEN_MODE)) == -1) { + free(fullname); return -1; + } if ((_fstat(fid, stab) 0) || !S_ISREG(stab.st_mode)) { + free(fullname); _close(fid); return -1; } } - nread = _read(fid, u.buf, sizeof u.buf); + u = malloc(sizeof(*u)); + if (u == NULL) + goto out
Re: cvsup broken on amd64?
On Sun, Sep 18, 2011 at 02:46:24PM +0200, Oliver Lehmann wrote: Kostik Belousov kostik...@gmail.com wrote: Did you saw the message with the patch for tzcode I mailed to you ? Mmmh... no didn't reached my mailbox - can you resend it please? See the Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW) thread on the current@, where you are explicitely Cc:ed. I posted updated patch a minute ago. pgpbmqafpbDxl.pgp Description: PGP signature
Re: Crashes in world built w/ clang: FP registers?
On Fri, Sep 16, 2011 at 10:34:40PM -0500, Jason Harmening wrote: Hi everyone, Using clang as the default compiler, the kernel and drivers will work fine, but a lot of programs in the base system and ports will crash w/ SIGBUS. In fact, so much of the stuff in the chroot'ed world will crash (everything from csh to gcc) that it's basically unusable. I finally got around to building w/ debug symbols, and ran gdb on a coredump generated while I was trying to use tab completion in csh: (gdb) bt #0 tw_collect (command=dwarf2_read_address: Corrupted DWARF expression.) at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:1308 #1 0x0042777b in t_search (word=Unhandled dwarf expression opcode 0x0) at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:1725 #2 0x00426829 in tenematch (inputline=Variable inputline is not avail able.) at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:301 #3 0x0043545d in Inputl () at /usr/src/bin/csh/../../contrib/tcsh/ed.inputl.c:415 #4 0x00417a90 in readc (wanteof=Variable wanteof is not available.) at /usr/src/bin/csh/../../contrib/tcsh/sh.lex.c:1653 #5 0x00416f37 in lex (hp=Variable hp is not available.) at /usr/src/bin/csh/../../contrib/tcsh/sh.lex.c:162 #6 0x00405afb in process (catch=Unhandled dwarf expression opcode 0x0) at /usr/src/bin/csh/../../contrib/tcsh/sh.c:1922 #7 0x00404b51 in main (argc=Variable argc is not available.) at /usr/src/bin/csh/../../contrib/tcsh/sh.c:1289 gdb) disas Dump of assembler code for function tw_collect: 0x004288b0 tw_collect+0: push %rbp 0x004288b1 tw_collect+1: mov%rsp,%rbp 0x004288b4 tw_collect+4: push %r15 0x004288b6 tw_collect+6: push %r14 0x004288b8 tw_collect+8: push %r13 0x004288ba tw_collect+10: push %r12 0x004288bc tw_collect+12: push %rbx 0x004288bd tw_collect+13: sub$0x2e8,%rsp 0x004288c4 tw_collect+20: mov%r9,-0x308(%rbp) 0x004288cb tw_collect+27: mov%r8,-0x300(%rbp) 0x004288d2 tw_collect+34: mov%rcx,-0x2f8(%rbp) 0x004288d9 tw_collect+41: mov%rdx,-0x2f0(%rbp) 0x004288e0 tw_collect+48: mov%esi,-0x2e8(%rbp) 0x004288e6 tw_collect+54: mov%edi,-0x2e4(%rbp) 0x004288ec tw_collect+60: movl $0x0,-0x1d4(%rbp) 0x004288f6 tw_collect+70: movaps 0x23115b(%rip),%xmm0 # 0x6 59a58 reslab+48 This is actually 0x659a58 reslab+48 movaps tried to load %xmm0 from the unaligned address, which is forbidden and causes #GP. I have no idea why clang generates unaligned loads. 0x004288fd tw_collect+77: lea-0x2(%rdi),%eax 0x00428900 tw_collect+80: mov%eax,-0x2e0(%rbp) 0x00428906 tw_collect+86: test %edi,%edi 0x00428908 tw_collect+88: movaps %xmm0,-0x210(%rbp) 0x0042890f tw_collect+95: sete %al ---Type return to continue, or q return to quit---q Quit (gdb) info line tw.parse.c:1308 Line 1308 of /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c starts at address 0x4288f6 tw_collect+70 and ends at 0x4288fd tw_collect+77. Looks like it's crashing as soon as it tries to use the XMM registers. I'm not sure if all of the crashes I'm getting are like this one, but I was surprised to see FP registers in code like this. I'm using march=corei7 and -O2 for both world and kernel, but using march=nocona or just leaving out CPUTYPE has no effect (actual CPU is Nehalem Xeon 5520) Here's the relevant part of make.conf for completeness: .if !defined(CC) || ${CC} == cc CC=clang .endif .if !defined(CXX) || ${CXX} == c++ CXX=clang++ .endif .if !defined(CPP) || ${CPP} == cpp CPP=clang -E .endif NO_WERROR= WERROR= NO_FSCHG= CPUTYPE?=corei7 CFLAGS= -O2 -pipe COPTFLAGS= -O2 -pipe Any thoughts? Is there some simple fix for this I'm missing? Thanks, Jason ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org pgpLoLcvYvQtt.pgp Description: PGP signature
Re: Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW)
On Wed, Sep 14, 2011 at 02:36:07PM +0200, Jeremie Le Hen wrote: Hi list, I've recently migrated my services from a box running 8.1-STABLE to another one running 9.0-BETA2. I run stunnel 4.28 on 8.1-STABLE, and it has run flawlessly so far. I compiled manually this very version on 9.0-BETA2. But I get the following segfault: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 803008c00 (LWP 100496/stunnel)] 0x00080110d359 in gmtime_r () from /lib/libc.so.7 (gdb) thread [Current thread is 3 (Thread 803008c00 (LWP 100496/stunnel))] (gdb) bt #0 0x00080110d359 in gmtime_r () from /lib/libc.so.7 #1 0x00080110cdde in gmtime_r () from /lib/libc.so.7 #2 0x00080110dab4 in gmtime_r () from /lib/libc.so.7 #3 0x00080110dcc8 in gmtime_r () from /lib/libc.so.7 #4 0x000800e1d9e8 in pthread_once () from /lib/libthr.so.3 #5 0x00080110ca9f in timegm () from /lib/libc.so.7 #6 0x000805dff8d9 in OPENSSL_gmtime () from /usr/local/lib/libcrypto.so.7 #7 0x000805e74631 in ASN1_UTCTIME_adj () from /usr/local/lib/libcrypto.so.7 #8 0x000805e9462d in X509_time_adj_ex () from /usr/local/lib/libcrypto.so.7 #9 0x000805e9478c in X509_cmp_time () from /usr/local/lib/libcrypto.so.7 #10 0x000805e9496d in internal_verify () from /usr/local/lib/libcrypto.so.7 #11 0x000805e95f46 in X509_verify_cert () from /usr/local/lib/libcrypto.so.7 #12 0x000805b7f4c8 in ssl_verify_cert_chain () from /usr/local/lib/libssl.so.7 #13 0x000805b5d6e3 in ssl3_get_client_certificate () from /usr/local/lib/libssl.so.7 #14 0x000805b612bc in ssl3_accept () from /usr/local/lib/libssl.so.7 #15 0x00406f6e in init_ssl (c=0x803093000) at client.c:329 #16 0x004069a6 in do_client (c=0x803093000) at client.c:202 #17 0x0040676b in run_client (c=0x803093000) at client.c:150 #18 0x004066cf in client (arg=0x803093000) at client.c:123 #19 0x000800e18224 in pthread_getprio () from /lib/libthr.so.3 #20 0x in ?? () Note that I tried with the newest version of stunnel, it crashes at the same place. I also tried libssl.so both from the base system and from the ports, same thing. You need to compile both libc and libthr with debugging symbols and do a backtrace with such libraries. pgpvXUeoZClco.pgp Description: PGP signature