from:"Kostik Belousov"

Re: Does anyone try kib's Sandy Bridge PCID patch (pcid.2.patch)?

2012-01-30 Thread Kostik Belousov

On Mon, Jan 30, 2012 at 07:08:13PM +0800, Paul Ambrose wrote:
?? 2012??1??30?? 2:36??Kostik Belousov kostik...@gmail.com ??
On Mon, Jan 30, 2012 at 10:15:51AM +0800, Paul Ambrose wrote:
I have two boxes, one is AMD Athlon 610e 2.4G with FreeBSD-current
patched with pcid.2.patch? It works well without other issue and it
seem the pcid patch
does not affect other part of the kernel. The other one is Sandy
Athlons do not have PCID and probably will never implement it. They use
other tricks to get similar optimizations, transparently to the OS.

Just curious, is this AMD similar optimizations
Address Space Number (ASN) and Global flag
US Patent 6,604,187.
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
This and the same-important next item 'The TLB Flush Filter' is what
I referred to.

I did not found anything about ASN in the AMD manual
It is a transparent optimization, which does not require any OS support.
Intel PCID is completely different, it shall be explicitely handled by OS.
It is some consequence of the nested pages support, AFAIU.

Bridge i5-2300 with FreeBSD 9 release patched with pcid.1.patch( the
pcid.2.patch seems
dependent on AVX and XSAVE stuffs which is available on -current). But
it hangs up just in a few minutes. I doubt the nvidia-driver which is
not recompiled with
patched kernel is the root, I will check this out later, but does
anyone meet similar problem?
There are two many variations compared to the config I did tested.
I do not see anything obvious in the changes between HEAD and stable/9
which could be blamed. Nvidia driver might be bigger suspect, but again,
I am not aware of anything wrong with it.

I have two question about the pcid.2.patch

Item 2 is clean, I fixed it.

For the item 1, I was only able to decipher the proposal to optimize
the global shootdown handler to restore the %cr3 with bit 64 set to not
invalidate current PCID. Is there some more changes ?

yes, that is what I meant. I was wondering using another way that each
process has different
pcid in each active processor, just as the freebsd mips and powerpc
uses. But obviously this way
is more friendly to non-pcid x86 processor.
Each vmspace (or pmap) has unique PCID with the patch, at least until
PCID space (12bit) is not exhausted. To really exhaust it, you need 4095
processes, so it is unlikely but possible event with the current settings.

pgpiiEGbdTChh.pgp
Description: PGP signature

Re: [ptrace] please review follow fork/exec changes

2012-01-30 Thread Kostik Belousov

On Mon, Jan 30, 2012 at 10:26:25AM -0800, Dmitry Mikulin wrote:
 
 
 On 01/28/2012 11:48 PM, Kostik Belousov wrote:
 On Fri, Jan 27, 2012 at 10:12:13AM -0800, Dmitry Mikulin wrote:
 Attached are 4 separate patches for each somewhat independent  portion of
 the kernel work related to the follow-fork implementation.
 Ok, as I said, I think that vfork-fork.patch is just wrong.
 
 Gdb needs to be able to read/write process memory between the time the 
 child is forked and exec is called (in the vfork case). Without the change 
 it causes a kernel panic when gdb tries to read/write process memory. Since 
 my understanding of the kernel is a bit limited, it was the best I could do 
 at the time. I will send more details about the panic once I get a working 
 fbsd system again. Maybe there's a better way to deal with the panic.
Please provide more details, I am looking forward for the panic
message and backtrace.

 
 Lets postpone discussion of the orphan.patch for later.
 
 OK.
 
 
 The follow-fork.patch and follow-exec.patch make me wonder, and I
 already expressed my doubts. IMO, all features, except one bug, are
 already presented in the stock src.
 
 More, suggested follow-{fork,exec} patches break the SCE/SCX tracers
 notification of fork and exec events, since TDB_FORK and TBD_EXEC flags
 are consumed before syscall returns (I also said this previously).
 
 
 
 
 
 Namely, if the process is being debugged, the successfull [f]execve()
 causes unconditional stop even. This makes PT_FOLLOW_EXEC unneccessary.
 
 Existing PT_FOLLOW_FORK implementation indeed has a bug, which was not
 revealed by my testing during the development, because I only tested
 SCE/SCX scenario. Namely, if PT_FOLLOW_FORK is requested, but the next
 stop is not SCX, then follow-fork notification is not sent. After this
 nit is fixed, PT_FOLLOW_FORK caller gets stops for the child creation.
 Child is put into stop state as needed to not loose it.
 
 I think this will fix only a part of the problem, the one that relates to 
 PT_CONTINUE.
 
 I still need the change that forces a stop in both child and parent on 
 fork(). Without my changes the notification is generated in the child but 
 not in the parent. I need to be able to have both processes stopped in gdb 
 in order to clean up and detach from the parent, and initialize and attach 
 to the child. The main reason I need both processes stopped is that gdb has 
 to be able to read/write into both processes address space and registers.  
 Ideally I would like to have a single event generated for fork() at a point 
 where both child and parent are stopped and available for ptrace read/write 
 requests.  Do you think it's possible?
The lack of the notification for parent is exactly what the patch I
posted fixes. More exactly, it is the lack of notification for parent
with PT_CONTINUE loop. I will commit this today.

Regarding a single notification. Currently, parent arrives at the
syscall return code only after the child is attached to the debugger.
See the cv_wait() in kern_fork.c:739. In other words, if you get the
PL_FLAG_FORK, the child is already attached (at last it shall be). My
scescx.c code illustrates this well, IMO.

You still get a separate stop from the child, but I do not see how is it
harmful.


pgpDwYkpcgypf.pgp
Description: PGP signature

Re: Does anyone try kib's Sandy Bridge PCID patch (pcid.2.patch)?

2012-01-29 Thread Kostik Belousov

On Mon, Jan 30, 2012 at 10:15:51AM +0800, Paul Ambrose wrote:
 I have two boxes, one is  AMD Athlon 610e 2.4G with FreeBSD-current
 patched with pcid.2.patch? It works well without other issue and it
 seem the pcid patch
 does not affect other part of the kernel. The other one is Sandy
Athlons do not have PCID and probably will never implement it. They use
other tricks to get similar optimizations, transparently to the OS.

 Bridge i5-2300 with FreeBSD 9 release patched with pcid.1.patch( the
 pcid.2.patch seems
 dependent on AVX and XSAVE stuffs which is available on -current). But
 it hangs up just in a few minutes. I doubt the nvidia-driver which is
 not recompiled with
 patched kernel is the root, I will check this out  later, but does
 anyone meet similar problem?
There are two many variations compared to the config I did tested.
I do not see anything obvious in the changes between HEAD and stable/9
which could be blamed. Nvidia driver might be bigger suspect, but again,
I am not aware of anything wrong with it.

 
 I have two question about the pcid.2.patch

Item 2 is clean, I fixed it.

For the item 1, I was only able to decipher the proposal to optimize
the global shootdown handler to restore the %cr3 with bit 64 set to not
invalidate current PCID. Is there some more changes ?

Thank you for looking at the patch.


pgpkG36NzMMe1.pgp
Description: PGP signature

Re: [ptrace] please review follow fork/exec changes

2012-01-28 Thread Kostik Belousov

On Fri, Jan 27, 2012 at 10:12:13AM -0800, Dmitry Mikulin wrote:
 Attached are 4 separate patches for each somewhat independent  portion of 
 the kernel work related to the follow-fork implementation.
Ok, as I said, I think that vfork-fork.patch is just wrong.
Lets postpone discussion of the orphan.patch for later.

The follow-fork.patch and follow-exec.patch make me wonder, and I
already expressed my doubts. IMO, all features, except one bug, are
already presented in the stock src.

More, suggested follow-{fork,exec} patches break the SCE/SCX tracers
notification of fork and exec events, since TDB_FORK and TBD_EXEC flags
are consumed before syscall returns (I also said this previously).

Namely, if the process is being debugged, the successfull [f]execve()
causes unconditional stop even. This makes PT_FOLLOW_EXEC unneccessary.

Existing PT_FOLLOW_FORK implementation indeed has a bug, which was not
revealed by my testing during the development, because I only tested
SCE/SCX scenario. Namely, if PT_FOLLOW_FORK is requested, but the next
stop is not SCX, then follow-fork notification is not sent. After this
nit is fixed, PT_FOLLOW_FORK caller gets stops for the child creation.
Child is put into stop state as needed to not loose it.

I updated the test program I use to test this functionality, see
http://people.freebsd.org/~kib/misc/scescx.c
The default or -s flag causes it to use SCE/SCX tracing, while -c flag
switches it to use PT_CONTINUE tracing, which should be the kind of loop
performed by normal debuggers. You can see the exec/fork events and
child auto-attach illustrated with this test.

Can you, please, provide the test-case which illustrates the omissions
in the existing API (with the patch below applied), if any ?

diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c
index bba4479..75328f6 100644
--- a/sys/kern/subr_syscall.c
+++ b/sys/kern/subr_syscall.c
@@ -212,7 +212,8 @@ syscallret(struct thread *td, int error, struct 
syscall_args *sa __unused)
 * executes.  If debugger requested tracing of syscall
 * returns, do it now too.
 */
-   if (traced  ((td-td_dbgflags  TDB_EXEC) != 0 ||
+   if (traced 
+   ((td-td_dbgflags  (TDB_FORK | TDB_EXEC)) != 0 ||
(p-p_stops  S_PT_SCX) != 0))
ptracestop(td, SIGTRAP);
td-td_dbgflags = ~(TDB_SCX | TDB_EXEC | TDB_FORK);


pgp116s22XAVu.pgp
Description: PGP signature

Re: knlist_empty locking fix

2012-01-27 Thread Kostik Belousov

On Thu, Jan 26, 2012 at 01:03:26PM -0800, Doug Ambrisko wrote:
 Ran into problems with running kqueue/aio with WITNESS etc.  Sometimes
 things are locked sometimes not.  knlist_remove is called telling it
 whether it is locked or not ie:
extern voidknlist_remove(struct knlist *knl, struct knote *kn, int 
 islocked);
 so I changed:
   extern int knlist_empty(struct knlist *knl);
 to:
   extern int knlist_empty(struct knlist *knl, int islocked);
 
 and then updated things to reflect that following what that state of the
 lock for knlist_remove.  If it is not locked, it gets a lock and 
 frees it after.
 
 This now fixes a panic when a process using kqueue/aio is killed on
 shutdown with WITNESS.
 
 It changes an API/ABI so it probably can't merged back.  If there are
 no objections then I'll commit it.
 
Change to knlist_init() does not make sense at all, the knlist shall
not be exposed to other consumers during initialization, so no need
to exclude the parallel access.

Regarding the knlist_empty(), I propose to keep it as is. Locking
the knlist inside knlist_empty() does not make sense, because lock
is immediately dropped afterward, and relocked for remove. This way,
the entry could be removed from the list meantime (can it, really ?).

I think that you should take a lock around the whole if() {} statement,
and call knlist_remove with locked == 1.


pgpCSJM8dw3vN.pgp
Description: PGP signature

Re: knlist_empty locking fix

2012-01-27 Thread Kostik Belousov

On Fri, Jan 27, 2012 at 09:42:58AM -0800, Doug Ambrisko wrote:
 Andrew Boyer writes:
 | On Jan 27, 2012, at 12:15 PM, Doug Ambrisko wrote:
 | 
 |  John Baldwin writes:
 |  | Agreed, I think the missing locking should just be added to the aio 
 code.
 |  
 |  Okay so then just:
 |  
 |  Index: vfs_aio.c
 |  ===
 |  RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/vfs_aio.c,v
 |  retrieving revision 1.243.2.3.4.1
 |  diff -u -p -r1.243.2.3.4.1 vfs_aio.c
 |  --- vfs_aio.c 21 Dec 2010 17:09:25 -  1.243.2.3.4.1
 |  +++ vfs_aio.c 27 Jan 2012 17:07:11 -
 |  @@ -2509,9 +2509,12 @@ static void
 |  filt_aiodetach(struct knote *kn)
 |  {
 |struct aiocblist *aiocbe = kn-kn_ptr.p_aio;
 |  + struct knlist *knl = aiocbe-klist;
 |  
 |  - if (!knlist_empty(aiocbe-klist))
 |  - knlist_remove(aiocbe-klist, kn, 0);
 |  + knl-kl_lock(knl-kl_lockarg);
 |  + if (!knlist_empty(knl))
 |  + knlist_remove(knl, kn, 1);
 |  + knl-kl_unlock(knl-kl_lockarg);
 |  }
 |  
 |  /* kqueue filter function */
 |  
 |  I was trying to be consistant with knlist_remove but this is a much
 |  smaller change that can be merge to older branches.
 |  
 | Does filt_liodetach() need the same treatment?
 
 Yes, I had that in the original.  I updated that and optimized
 the knl to just get the structure needed.
 
 Index: vfs_aio.c
 ===
 RCS file: /usr/local/cvsroot/freebsd/src/sys/kern/vfs_aio.c,v
 retrieving revision 1.243.2.3.4.1
 diff -u -p -r1.243.2.3.4.1 vfs_aio.c
 --- vfs_aio.c 21 Dec 2010 17:09:25 -  1.243.2.3.4.1
 +++ vfs_aio.c 27 Jan 2012 17:35:47 -
 @@ -2508,10 +2508,12 @@ filt_aioattach(struct knote *kn)
  static void
  filt_aiodetach(struct knote *kn)
  {
 - struct aiocblist *aiocbe = kn-kn_ptr.p_aio;
 + struct knlist *knl = kn-kn_ptr.p_aio-klist;
  
 - if (!knlist_empty(aiocbe-klist))
 - knlist_remove(aiocbe-klist, kn, 0);
 + knl-kl_lock(knl-kl_lockarg);
 + if (!knlist_empty(knl))
 + knlist_remove(knl, kn, 1);
 + knl-kl_unlock(knl-kl_lockarg);
  }
  
  /* kqueue filter function */
 @@ -2553,10 +2555,12 @@ filt_lioattach(struct knote *kn)
  static void
  filt_liodetach(struct knote *kn)
  {
 - struct aioliojob * lj = kn-kn_ptr.p_lio;
 + struct knlist *knl = kn-kn_ptr.p_lio-klist;
It is easy to be style-compiant there and move initialization of knl
after the blank line.

Do you need two different functions now ? I think you can leave just one.

Otherwise looks fine.
  
 - if (!knlist_empty(lj-klist))
 - knlist_remove(lj-klist, kn, 0);
 + knl-kl_lock(knl-kl_lockarg);
 + if (!knlist_empty(knl))
 + knlist_remove(knl, kn, 1);
 + knl-kl_unlock(knl-kl_lockarg);
  }
  
  /* kqueue filter function */
 
 Thanks,
 
 Doug A.


pgpoFy8bRMgwv.pgp
Description: PGP signature

Re: [ptrace] please review follow fork/exec changes

2012-01-26 Thread Kostik Belousov

On Wed, Jan 25, 2012 at 03:48:04PM -0800, Dmitry Mikulin wrote:
 Thanks for taking a look at it. I'll try to explain the changes the best I 
 can: the work was done nearly 6 months ago...
 
 I would certainly appreciate some more words describing the changes.
 
 What is the goal of introducing the PT_FOLLOW_EXEC ? To not force
 the debugger to filter all syscall exits to see exec events ?
 
 PT_FOLLOW_EXEC was added to trace the exec() family of calls. When it's 
 enabled, the kernel will generate a trap to give debugger a chance to clean 
 up old process info and initialize its state with the new binary.
 
It was more a question, why PT_FLAG_EXEC is not enough.

 
 
 
 Why did you moved the stopevent/ptracestop for exec events from
 syscallret() to kern_execve() ? If moving, why the handling of TDB_EXEC
 is not removed from syscallret() ? I do not think that TDB_EXEC can be
 seen there after the patch. The same question about TDB_FORK.
 
 The reason I moved the event notifications from syscallret() is because the 
 debugger is interested successful completion of either fork or exec, not 
 just the fact that they have been called. If, say, a call to fork() failed, 
 as far as debugger is concerned, fork() might as well had never been 
 called. Having a ptracestop in syscallret triggered a trap on every return 
 from fork without telling the debugger whether a new process had been 
 created or not. Same goes for exec().
PT_FLAG_EXEC is only set if an exec-kind of syscall succeeded. The same
is true for PT_FLAG_FORKED, the flag is set only if a new child was
successfully created.

 
 Looking at the code now I think I should have removed TDB_EXEC/TDB_FORK 
 processing from syscallret(). I'll do that and verify that it works as 
 expected.
 
 
 If possible, I would greatly prefer to have fork changes separated.
 
 You mean separate fork changes into a separate patch from exec changes?
Yes. Even more, it seems that fork changes should be separated into
bug fixes and new functionality.

 
 
 I doubt that disallowing RFMEM while tracing is the right change. It may
 be to fix some (undescribed) bug, but RFMEM is documented behaviour used
 not only for vfork(2), and the change just breaks rfork(2) for debugged
 processes.
 
 Even vfork() should not be broken, since I believe there are programs
 that rely on the vfork() model, in particular, C shell. It will be
 broken if vfork() is substituted by fork(). The fact that it breaks only
 under debugger will make it esp. confusing.
 
 I need to dig up the details since I can't recall the exact reason for 
 forcing fork() in cases of user calls to vfork() under gdb. I believe it 
 had to do with when child process memory was available for writing in case 
 of vfork() and it wasn't early enough to complete the switch over from 
 parent to child in gdb. After consulting with our internal kernel experts 
 we decided that doing fork() instead of vfork() under gdb is a low impact 
 change.
 
 
 Why do we need to have TDB_FORK set for td2 ?
 
 The debugger needs to intercept fork() in both parent and child so it can 
 detach from the old process and attach to the new one. Maybe it'll make 
 more sense in the context of gdb changes. Should I send them too? Don't 
 think Marcel included that patch...
No, I think the kernel changes are enough for now.

 
 
 Does the orphan list change intended to not lost the child after fork ?
 But the child shall be traced, so debugger would get the SIGTRAP on
 the attach on fork returning to usermode. I remember that I explicitely
 tested this when adding followfork changes.
 
 Yes, the debugger gets SIGTRAPs. The problem arises when the real parent of 
 the forked process has the code to collect termination status. Since 
 attaching to a process changes the parent/child relationships, we need to 
 keep track of the children lost due to re-parenting so we can properly 
 attribute their exit status to the real parent.
 
I do not quite understand it. From the description it sounds as the problem
that is not related to follow fork changes at all, and shall be presented
regardless of the follow fork. If yes, I think this shall be separated
into a standalone patch.


pgphUacrPo7lj.pgp
Description: PGP signature

Re: panic: No NCF_TS

2012-01-25 Thread Kostik Belousov

On Mon, Jan 23, 2012 at 06:54:49AM -0800, John Baldwin wrote:
 On 1/22/12 7:05 PM, Kostik Belousov wrote:
  On Mon, Jan 23, 2012 at 05:36:42AM +0400, Yuri Pankov wrote:
  Seems to be reproducible here running r230467 as the NFS client and
  r230135 as NFS server. NFSv4 not enabled.
 
  # mount
  [...]
  sirius:/data/distfiles on /usr/ports/distfiles (nfs)
 
  # /usr/bin/env /usr/bin/fetch -AFpr -S 4682084 -o 
  /usr/ports/distfiles/sqlite-src-3071000.zip 
  http://www.sqlite.org/sqlite-src-3071000.zip
  /usr/ports/distfiles/sqlite-src-3071000.zip   100% of 4572 kB  379 kBps 
  00m00s
  # rm /usr/ports/distfiles/sqlite-src-3071000.zip
 
  immediately followed by:
 
  panic: No NCF_TS
  cpuid = 2
  KDB: enter: panic
  [ thread pid 1603 tid 100494 ]
  Stopped at kdb_enter+0x3e: movq$0,kdb_why
  db bt
  Tracing pid 1603 tid 100494 td 0xfe0089585460
  kdb_enter() at kdb_enter+0x3e
  panic() at panic+0x245
  cache_lookup_times() at cache_lookup_times+0x6b5
  nfs_lookup() at nfs_lookup+0x190
  VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8b
  lookup() at lookup+0x7e9
  namei() at namei+0x74c
  kern_statat_vnhook() at kern_statat_vnhook+0x90
  sys_lstat() at sys_lstat+0x30
  amd64_syscall() at amd64_syscall+0x221
  Xfast_syscall() at Xfast_syscall+0xfb
  --- syscall (190, FreeBSD ELF64, sys_lstat), rip = 0x80093ff3c, rsp = 
  0x7fffd8d8, rbp = 0x7fffd979 ---
 
  
  Yes, my bad. I wrote the r230441 with the assumption that filesystems
  are consistent in their use of cache_enter_time(). And my net-booting
  test box did not catched this, which is at least interesting.
  
  Please try the change below. Actually, it should be enough to only apply
  the changes for fs/nfsclient (or nfsclient/ if you use old nfs). If this
  does not help, then please try the whole patch.
 
 I think we should have the existing assertion.  If cache_lookup_times()
 is called with a timestamp requested, then returning a name cache entry
 without a timestamp is just going to result in that name cache entry not
 being used.  Panic'ing in that case is correct.
 
 With regard to the NFS changes below, all of these are bugs that didn't
 really work right before.  Specifically, adding a positive entry without
 setting n_ctime previously would have just resulted in the name cache
 entry being purged on the next lookup anyway.  Keep in mind that the
 timestamp for a give name cache entry in NFS needs to match an actual
 timestamp returned by the NFS server as post-op attributes in some RPC.
 Using the timestamp from vfs_timestamp() is completely bogus.  Instead,
 the timestamp for a negative entry needs to be the mtime of the parent
 directory.  If we don't have that timestamp handy, then we should just
 not add a namecache entry at all.  Also, the vap-va_ctime used below
 for mknod() and create() is not a timestamp from the server, so it is
 also suspect.  I can look at this in more detail on Wednesday, but my
 best guess is that nearly all (if not all) of these cache_enter() calls
 will simply need to be removed.
 
 Note that other filesystems like UFS don't bother creating name cache
 entries for things like create or mknod.  It's debatable if the NFS
 client should even be creating any name cache entries outside of lookup
 and when handling a READDIR+ reply.

Ok, next try. I did removed the cache_enter calls for old nfs client,
but it seems that the calls can be kept for the new client.


diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c
index 2747191..709cd8d 100644
--- a/sys/fs/nfsclient/nfs_clvnops.c
+++ b/sys/fs/nfsclient/nfs_clvnops.c
@@ -1431,8 +1431,6 @@ nfs_mknodrpc(struct vnode *dvp, struct vnode **vpp, 
struct componentname *cnp,
}
}
if (!error) {
-   if ((cnp-cn_flags  MAKEENTRY))
-   cache_enter(dvp, newvp, cnp);
*vpp = newvp;
} else if (NFS_ISV4(dvp)) {
error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid,
@@ -1591,7 +1589,7 @@ again:
}
if (!error) {
if (cnp-cn_flags  MAKEENTRY)
-   cache_enter(dvp, newvp, cnp);
+   cache_enter_time(dvp, newvp, cnp, vap-va_ctime);
*ap-a_vpp = newvp;
} else if (NFS_ISV4(dvp)) {
error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid,
@@ -1966,8 +1964,9 @@ nfs_link(struct vop_link_args *ap)
 * must care about lookup caching hit rate, so...
 */
if (VFSTONFS(vp-v_mount)-nm_negnametimeo != 0 
-   (cnp-cn_flags  MAKEENTRY))
-   cache_enter(tdvp, vp, cnp);
+   (cnp-cn_flags  MAKEENTRY)  dattrflag) {
+   cache_enter_time(tdvp, vp, cnp, dnfsva.na_mtime);
+   }
if (error  NFS_ISV4(vp))
error = nfscl_maperr(cnp-cn_thread, error, (uid_t)0,
(gid_t)0);
@@ -2023,15 +2022,6 @@ nfs_symlink(struct vop_symlink_args *ap)
error

Re: nullfs broken on powerpc

2012-01-25 Thread Kostik Belousov

On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote:
 Hi,
 
 as I succesfully installed FreeBSD on Mac Mini (older powerpc model
 with G4 CPU) I tried to use mount_nullfs to share some files for more
 systems. I use this method for years on i386 and amd64 platforms for
 years now to share sources and other files.
 
 Basically I want to have small system specific slice/partition and
 large shared data area. System sources are in shared area as well as
 some working area (think /usr/src and /usr/obj directories).
 
 This does not work with powerpc for me. With sources csup'ped this
 morning, full system rebuild with GENERIC kernel, it is enough for me
 to issue
 
 mount_nullfs /data/src10 /usr/src
 csup /usr/share/examples/cvsup/standard-supfile
 
 and system panic occurs, with following on system console:
 
 panic: mtx_lock() of spin mutex (null) @ /usr/src/sys/kern/vfs_subr.c:2670
 cpuid = 0
 KDB: enter: panic
 [ thread pid 1442 tid 100095 ]
 Stopped at 0x40f734: addi r0, r0, 0x0
 db
 
 At this point, I am able to interact with system, the question for me
 is what I want to get from it :) I tried bt with following result:
 
 Tracing pid 1442 tid 100095 td 0x2d6b000
 0xe22c26d0: at panic+0x274
 0xe22c2730: at _mtx_lock_flags+0xc4
 0xe22c2760: at vgonel+0x330
 0xe22c27b0: at vrecycle+0x54
 0xe22c27d0: at null_inactive+0x30
 0xe22c27f0: at VOP_INACTIVE_APV+0xdc
 0xe22c2810: at vinactive+0x98
 0xe22c2850: at vputx+0x344
 0xe22c28a0: at vput+0x18
 0xe22c28c0: at kern_statat_vnhook+0x108
 0xe22c29d0: at kern_statat+0x18
 0xe22c29f0: at kern_lstat+0x2c
 0xe22c2a10: at sys_lstat+0x30
 0xe22c2a90: at trap+0x388
 0xe22c2b60: at powerpc_interrupt+0x108
 0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032
 r1=0xffaf9a70 cr=0x28004044 xer=0x2000 ctr=0x41a0ac40
 db
 
 Does this shed any light for someone with more knowledge here? My gut
 feeling is there is some endianness issue at play, the same nullfs
 usage works for me flawlessly on both i386 and amd64 systems, so it
 could not be 32 vs 64 bit issue at least.
 
 At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function void
 vgonel(struct vnode *vp)
 
 VI_LOCK(vp);
 vp-v_vnlock = vp-v_lock;
 vp-v_op = dead_vnodeops;
 vp-v_tag = none;
 vp-v_type = VBAD;
 }
 
 so the question seems to be reduced to 'why is vp null?' or is my small
 attempt on analyse flawed...
I do not think that the vp is null. It more look like the *vp memory
was zeroed. This has very low chances of being related to endianess, and
more like a kernel memory corruption.

Take a dump and print the content of *vp.


pgpNGYFTFCBWD.pgp
Description: PGP signature

Re: nullfs broken on powerpc

2012-01-25 Thread Kostik Belousov

On Wed, Jan 25, 2012 at 08:50:41PM +0100, Milan Obuch wrote:
 On Wed, 25 Jan 2012 14:21:23 +0200
 Kostik Belousov kostik...@gmail.com wrote:
 
  On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote:
   Hi,
   
 
 [ snip ]
 
   This does not work with powerpc for me. With sources csup'ped this
   morning, full system rebuild with GENERIC kernel, it is enough for
   me to issue
   
   mount_nullfs /data/src10 /usr/src
   csup /usr/share/examples/cvsup/standard-supfile
   
   and system panic occurs, with following on system console:
   
   panic: mtx_lock() of spin mutex (null)
   @ /usr/src/sys/kern/vfs_subr.c:2670 cpuid = 0
   KDB: enter: panic
   [ thread pid 1442 tid 100095 ]
   Stopped at 0x40f734: addi r0, r0, 0x0
   db
   
   At this point, I am able to interact with system, the question for
   me is what I want to get from it :) I tried bt with following
   result:
   
   Tracing pid 1442 tid 100095 td 0x2d6b000
   0xe22c26d0: at panic+0x274
   0xe22c2730: at _mtx_lock_flags+0xc4
   0xe22c2760: at vgonel+0x330
   0xe22c27b0: at vrecycle+0x54
   0xe22c27d0: at null_inactive+0x30
   0xe22c27f0: at VOP_INACTIVE_APV+0xdc
   0xe22c2810: at vinactive+0x98
   0xe22c2850: at vputx+0x344
   0xe22c28a0: at vput+0x18
   0xe22c28c0: at kern_statat_vnhook+0x108
   0xe22c29d0: at kern_statat+0x18
   0xe22c29f0: at kern_lstat+0x2c
   0xe22c2a10: at sys_lstat+0x30
   0xe22c2a90: at trap+0x388
   0xe22c2b60: at powerpc_interrupt+0x108
   0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032
   r1=0xffaf9a70 cr=0x28004044 xer=0x2000
   ctr=0x41a0ac40
   db
   
   Does this shed any light for someone with more knowledge here? My
   gut feeling is there is some endianness issue at play, the same
   nullfs usage works for me flawlessly on both i386 and amd64
   systems, so it could not be 32 vs 64 bit issue at least.
   
   At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function
   void vgonel(struct vnode *vp)
   
   VI_LOCK(vp);
   vp-v_vnlock = vp-v_lock;
   vp-v_op = dead_vnodeops;
   vp-v_tag = none;
   vp-v_type = VBAD;
   }
   
   so the question seems to be reduced to 'why is vp null?' or is my
   small attempt on analyse flawed...
 
  I do not think that the vp is null. It more look like the *vp memory
  was zeroed. This has very low chances of being related to endianess,
  and more like a kernel memory corruption.
  
  Take a dump and print the content of *vp.
 
 How could I look into memory? I found page
 http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-online-ddb.html
 and I can see registers (show reg), use x with absolute addresses, but
 something like 'x vp' tells just 'Symbol not known' - should I somehow
 load symbol table into memory? But backtrace shows function names... or
 should I somehow modify GENERIC kernel to include more debugging info?
 Kernel debugging is a bit new for me, even if I can write simple
 modification into kernel, but only in some special (and narrow) area of
 code...

You shall/could take a dump and then use kgdb to look at *vp.
If doing from ddb, you need to look at the disassembly of the
function to undestand where to find the vp (probably in some register).


pgp34H56yP2bJ.pgp
Description: PGP signature

Re: nullfs broken on powerpc

2012-01-25 Thread Kostik Belousov

On Wed, Jan 25, 2012 at 10:00:26PM +0100, Andreas Tobler wrote:
 On 25.01.12 21:29, Eitan Adler wrote:
 On Wed, Jan 25, 2012 at 2:50 PM, Milan Obuchfreebsd-curr...@dino.sk  
 wrote:
 On Wed, 25 Jan 2012 14:21:23 +0200
 Kostik Belousovkostik...@gmail.com  wrote:
 
 On Tue, Jan 24, 2012 at 06:31:52PM +0100, Milan Obuch wrote:
 Hi,
 
 
 [ snip ]
 
 This does not work with powerpc for me. With sources csup'ped this
 morning, full system rebuild with GENERIC kernel, it is enough for
 me to issue
 
 mount_nullfs /data/src10 /usr/src
 csup /usr/share/examples/cvsup/standard-supfile
 
 and system panic occurs, with following on system console:
 
 panic: mtx_lock() of spin mutex (null)
 @ /usr/src/sys/kern/vfs_subr.c:2670 cpuid = 0
 KDB: enter: panic
 [ thread pid 1442 tid 100095 ]
 Stopped at 0x40f734: addi r0, r0, 0x0
 db
 
 At this point, I am able to interact with system, the question for
 me is what I want to get from it :) I tried bt with following
 result:
 
 Tracing pid 1442 tid 100095 td 0x2d6b000
 0xe22c26d0: at panic+0x274
 0xe22c2730: at _mtx_lock_flags+0xc4
 0xe22c2760: at vgonel+0x330
 0xe22c27b0: at vrecycle+0x54
 0xe22c27d0: at null_inactive+0x30
 0xe22c27f0: at VOP_INACTIVE_APV+0xdc
 0xe22c2810: at vinactive+0x98
 0xe22c2850: at vputx+0x344
 0xe22c28a0: at vput+0x18
 0xe22c28c0: at kern_statat_vnhook+0x108
 0xe22c29d0: at kern_statat+0x18
 0xe22c29f0: at kern_lstat+0x2c
 0xe22c2a10: at sys_lstat+0x30
 0xe22c2a90: at trap+0x388
 0xe22c2b60: at powerpc_interrupt+0x108
 0xe22c2b90: user SC trap by _end+0x40d88c70: srr1=0xd032
  r1=0xffaf9a70 cr=0x28004044 xer=0x2000
 ctr=0x41a0ac40
 db
 
 Does this shed any light for someone with more knowledge here? My
 gut feeling is there is some endianness issue at play, the same
 nullfs usage works for me flawlessly on both i386 and amd64
 systems, so it could not be 32 vs 64 bit issue at least.
 
 At line 2670 of /usr/src/sys/kern/vfs_subr.c I see end of function
 void vgonel(struct vnode *vp)
 
  VI_LOCK(vp);
  vp-v_vnlock =vp-v_lock;
  vp-v_op =dead_vnodeops;
  vp-v_tag = none;
  vp-v_type = VBAD;
 }
 
 so the question seems to be reduced to 'why is vp null?' or is my
 small attempt on analyse flawed...
 
 I do not think that the vp is null. It more look like the *vp memory
 was zeroed. This has very low chances of being related to endianess,
 and more like a kernel memory corruption.
 
 Take a dump and print the content of *vp.
 
 How could I look into memory? I found page
 http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-online-ddb.html
 and I can see registers (show reg), use x with absolute addresses, but
 something like 'x vp' tells just 'Symbol not known' - should I somehow
 load symbol table into memory? But backtrace shows function names... or
 should I somehow modify GENERIC kernel to include more debugging info?
 Kernel debugging is a bit new for me, even if I can write simple
 modification into kernel, but only in some special (and narrow) area of
 code...
 
 From ddb write 'call doadump'. Provided you have a proper dump device
 set up in rc.conf it should work. You could then use kgdb from a
 running computer to analyze the dump in more detail.
 
 This only works if your target is booke, AIM (Apple based machines) do 
 not have the 'call doadump' implemented yet. It is somewhere on my long 
 todo list.
FWIW, it is not 'call doadump', it is just 'dump' for some time.
I think calling doadump does not work.


pgprHbTfgnpRg.pgp
Description: PGP signature

Re: [ptrace] please review follow fork/exec changes

2012-01-24 Thread Kostik Belousov

On Tue, Jan 24, 2012 at 01:36:55PM -0800, Marcel Moolenaar wrote:
 All,
 
 Please review the attached changes (done by Dmitry -- CC'd) to add support for
 PT_FOLLOW_EXEC and cleanup PT_FOLLOW_FORK.
 
 I'll commit this when there are no comments/objections.
 Thanks,

I would certainly appreciate some more words describing the changes.

What is the goal of introducing the PT_FOLLOW_EXEC ? To not force
the debugger to filter all syscall exits to see exec events ?

Why did you moved the stopevent/ptracestop for exec events from
syscallret() to kern_execve() ? If moving, why the handling of TDB_EXEC
is not removed from syscallret() ? I do not think that TDB_EXEC can be
seen there after the patch. The same question about TDB_FORK.

If possible, I would greatly prefer to have fork changes separated.

I doubt that disallowing RFMEM while tracing is the right change. It may
be to fix some (undescribed) bug, but RFMEM is documented behaviour used
not only for vfork(2), and the change just breaks rfork(2) for debugged
processes.

Even vfork() should not be broken, since I believe there are programs
that rely on the vfork() model, in particular, C shell. It will be
broken if vfork() is substituted by fork(). The fact that it breaks only
under debugger will make it esp. confusing.

Why do we need to have TDB_FORK set for td2 ?

Does the orphan list change intended to not lost the child after fork ?
But the child shall be traced, so debugger would get the SIGTRAP on
the attach on fork returning to usermode. I remember that I explicitely
tested this when adding followfork changes.


pgp3FXLhsTAQ8.pgp
Description: PGP signature

Re: panic: No NCF_TS

2012-01-22 Thread Kostik Belousov

On Mon, Jan 23, 2012 at 05:36:42AM +0400, Yuri Pankov wrote:
 Seems to be reproducible here running r230467 as the NFS client and
 r230135 as NFS server. NFSv4 not enabled.
 
 # mount
 [...]
 sirius:/data/distfiles on /usr/ports/distfiles (nfs)
 
 # /usr/bin/env /usr/bin/fetch -AFpr -S 4682084 -o 
 /usr/ports/distfiles/sqlite-src-3071000.zip 
 http://www.sqlite.org/sqlite-src-3071000.zip
 /usr/ports/distfiles/sqlite-src-3071000.zip   100% of 4572 kB  379 kBps 00m00s
 # rm /usr/ports/distfiles/sqlite-src-3071000.zip
 
 immediately followed by:
 
 panic: No NCF_TS
 cpuid = 2
 KDB: enter: panic
 [ thread pid 1603 tid 100494 ]
 Stopped atkdb_enter+0x3e: movq$0,kdb_why
 db bt
 Tracing pid 1603 tid 100494 td 0xfe0089585460
 kdb_enter() at kdb_enter+0x3e
 panic() at panic+0x245
 cache_lookup_times() at cache_lookup_times+0x6b5
 nfs_lookup() at nfs_lookup+0x190
 VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x8b
 lookup() at lookup+0x7e9
 namei() at namei+0x74c
 kern_statat_vnhook() at kern_statat_vnhook+0x90
 sys_lstat() at sys_lstat+0x30
 amd64_syscall() at amd64_syscall+0x221
 Xfast_syscall() at Xfast_syscall+0xfb
 --- syscall (190, FreeBSD ELF64, sys_lstat), rip = 0x80093ff3c, rsp = 
 0x7fffd8d8, rbp = 0x7fffd979 ---
 

Yes, my bad. I wrote the r230441 with the assumption that filesystems
are consistent in their use of cache_enter_time(). And my net-booting
test box did not catched this, which is at least interesting.

Please try the change below. Actually, it should be enough to only apply
the changes for fs/nfsclient (or nfsclient/ if you use old nfs). If this
does not help, then please try the whole patch.

Even if not needed for correctness, I think vfs_cache.c change could
be made useful by adding KASSERT to it.

commit 9771eb4739170d014fcebfbad07bfed4076c6d85
Author: Konstantin Belousov kos...@pooma.home
Date:   Mon Jan 23 04:36:08 2012 +0200

Make NCF_TS mismatch not fatal.
Consistently use cache_enter_time in nfs clients.

diff --git a/sys/fs/nfsclient/nfs_clvnops.c b/sys/fs/nfsclient/nfs_clvnops.c
index 2747191..2841647 100644
--- a/sys/fs/nfsclient/nfs_clvnops.c
+++ b/sys/fs/nfsclient/nfs_clvnops.c
@@ -1432,7 +1432,7 @@ nfs_mknodrpc(struct vnode *dvp, struct vnode **vpp, 
struct componentname *cnp,
}
if (!error) {
if ((cnp-cn_flags  MAKEENTRY))
-   cache_enter(dvp, newvp, cnp);
+   cache_enter_time(dvp, newvp, cnp, vattr.va_ctime);
*vpp = newvp;
} else if (NFS_ISV4(dvp)) {
error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid,
@@ -1591,7 +1591,7 @@ again:
}
if (!error) {
if (cnp-cn_flags  MAKEENTRY)
-   cache_enter(dvp, newvp, cnp);
+   cache_enter_time(dvp, newvp, cnp, vap-va_ctime);
*ap-a_vpp = newvp;
} else if (NFS_ISV4(dvp)) {
error = nfscl_maperr(cnp-cn_thread, error, vap-va_uid,
@@ -1923,6 +1923,7 @@ nfs_link(struct vop_link_args *ap)
struct componentname *cnp = ap-a_cnp;
struct nfsnode *np, *tdnp;
struct nfsvattr nfsva, dnfsva;
+   struct timespec ts;
int error = 0, attrflag, dattrflag;
 
if (vp-v_mount != tdvp-v_mount) {
@@ -1966,8 +1967,10 @@ nfs_link(struct vop_link_args *ap)
 * must care about lookup caching hit rate, so...
 */
if (VFSTONFS(vp-v_mount)-nm_negnametimeo != 0 
-   (cnp-cn_flags  MAKEENTRY))
-   cache_enter(tdvp, vp, cnp);
+   (cnp-cn_flags  MAKEENTRY)) {
+   vfs_timestamp(ts);
+   cache_enter_time(tdvp, vp, cnp, ts);
+   }
if (error  NFS_ISV4(vp))
error = nfscl_maperr(cnp-cn_thread, error, (uid_t)0,
(gid_t)0);
@@ -1987,6 +1990,7 @@ nfs_symlink(struct vop_symlink_args *ap)
struct nfsfh *nfhp;
struct nfsnode *np = NULL, *dnp;
struct vnode *newvp = NULL;
+   struct timespec ts;
int error = 0, attrflag, dattrflag, ret;
 
vap-va_type = VLNK;
@@ -2030,8 +2034,10 @@ nfs_symlink(struct vop_symlink_args *ap)
 * must care about lookup caching hit rate, so...
 */
if (VFSTONFS(dvp-v_mount)-nm_negnametimeo != 0 
-   (cnp-cn_flags  MAKEENTRY))
-   cache_enter(dvp, newvp, cnp);
+   (cnp-cn_flags  MAKEENTRY)) {
+   vfs_timestamp(ts);
+   cache_enter_time(dvp, newvp, cnp, ts);
+   }
*ap-a_vpp = newvp;
}
 
@@ -2063,6 +2069,7 @@ nfs_mkdir(struct vop_mkdir_args *ap)
struct vattr vattr;
struct nfsfh *nfhp;
struct nfsvattr nfsva, dnfsva;
+   struct timespec ts;
int error = 0, attrflag, dattrflag, ret;
 
if ((error = VOP_GETATTR(dvp, vattr, cnp-cn_cred)) != 0)
@@ -2116,8 +2123,10 @@ nfs_mkdir(struct vop_mkdir_args *ap)
 *

Re: FreeBSD 9 recompile ports

2012-01-13 Thread Kostik Belousov

On Fri, Jan 13, 2012 at 04:11:22PM +0200, Andriy Gapon wrote:
 on 13/01/2012 14:57 George Kontostanos said the following:
  Still the question remains regarding COMPAT_FREEBSD8 and how does this
  affects ports/misc/compat8x/
 
 Looks like all the previous hints have not been clear enough.
 There is no direct relation between COMPAT_FREEBSD8 and misc/compat8x.
 COMPAT_FREEBSDX options are only needed when going from release X to 
 release X+1
 there was a change to an existing system call at the kernel-userland boundary.
 A side note: kernel options affect only what's in the kernel, quite obviously.
 misc/compatXx contains versions of shared libraries from release X that are 
 no
 longer present in X+1.

Additional twist is that not every change at the kernel/usermode boundary
is covered with backward-compatibility shims. Recent example is the CAM
ABI change, which makes libcam.so.5 from the compat8x useless.


pgpuaqFjafo46.pgp
Description: PGP signature

Re: CD9660/md(4)/UFS22 silly behaviour

2012-01-08 Thread Kostik Belousov

On Sun, Jan 08, 2012 at 10:21:50PM +, Poul-Henning Kamp wrote:
 
 I'm doing som data-mining on a pile of ISO images right now.
 
 I stuck the ISOs on a UFS2 on a flash-disk for speed, and mdconfig(8)'d
 them so I could mount them.
 
 The traffic pattern his interesting:
 
 dT: 1.003s  w: 1.000s
  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
 [...]
 1733733   14661.3  0  00.0   98.2| md39
 1733733  234491.3  0  00.0   93.2| da0
 
 Notice the 1:16 ratio on kBps but 1:1 ratio on ops/s ?
 
 da0's UFS2 has 32k block-size:
 
   magic   19540119 (UFS2) timeWed Jan  4 16:41:47 2012
   superblock location 65536   id  [ 4f046cf5 c30697ee ]
   ncg 104 size19537685blocks  19228156
   bsize   32768   shift   15  mask0x8000
   fsize   4096shift   12  mask0xf000
   [...]
 
 It looks like every 2k read from CD9660 turns into a 32k block
 read in the UFS filesystem, without any beneficial caching happening.
 
 Less than optimal I'd say...
 
What is the access patern ? Is it random access, or sequential read
(from the cd9660 POV) ?


pgpj4AQGx2y89.pgp
Description: PGP signature

Re: CD9660/md(4)/UFS22 silly behaviour

2012-01-08 Thread Kostik Belousov

On Sun, Jan 08, 2012 at 10:31:06PM +, Poul-Henning Kamp wrote:
 In message 20120108222720.gn31...@deviant.kiev.zoral.com.ua, Kostik 
 Belousov 
 writes:
 
 What is the access patern ? Is it random access, or sequential read
 (from the cd9660 POV) ?
 
 Random access to files in the CD9660 filesystem, which stores files
 in sequential 2K blocks.

Then it is reasonable. UFS reads full blocks. If you want/plan to use
UFS volume for small reads exclusively, you can newfs it with much
smaller block size, e.g. 8KB or even 4KB.


pgp2u3DjXRzwg.pgp
Description: PGP signature

Re: CD9660/md(4)/UFS22 silly behaviour

2012-01-08 Thread Kostik Belousov

On Sun, Jan 08, 2012 at 04:09:00PM -0800, Don Lewis wrote:
 On  8 Jan, Poul-Henning Kamp wrote:
  
  I'm doing som data-mining on a pile of ISO images right now.
  
  I stuck the ISOs on a UFS2 on a flash-disk for speed, and mdconfig(8)'d
  them so I could mount them.
  
  The traffic pattern his interesting:
  
  dT: 1.003s  w: 1.000s
   L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
  [...]
  1733733   14661.3  0  00.0   98.2| md39
  1733733  234491.3  0  00.0   93.2| da0
  
  Notice the 1:16 ratio on kBps but 1:1 ratio on ops/s ?
  
  da0's UFS2 has 32k block-size:
  
  magic   19540119 (UFS2) timeWed Jan  4 16:41:47 2012
  superblock location 65536   id  [ 4f046cf5 c30697ee ]
  ncg 104 size19537685blocks  19228156
  bsize   32768   shift   15  mask0x8000
  fsize   4096shift   12  mask0xf000
  [...]
  
  It looks like every 2k read from CD9660 turns into a 32k block
  read in the UFS filesystem, without any beneficial caching happening.
 
 Probably some confusion about which filesystem layer owns the cached
 data.  It would probably be inefficient to cache the data in both
 places.  The best fix would probably be for CD9660 to think that the
 underlying device has 32Kb sectors.
I discussed the issue with phk further. The reason for discarding
the 30K of the read 32K block is that md(4) supplies IO_DIRECT flag
for VOP_READ, and FFS avoids putting the read data into any cache.

Most likely, we can implement a sysctl that would disable direct
reads, at the cost of double-buffering the data. For obvious reasons,
it is impossible to disable caching from the filesystem living on
top of md(4) volume.


pgpuHe11Cxjyv.pgp
Description: PGP signature

Re: dogfooding over in clusteradm land

2012-01-03 Thread Kostik Belousov

On Tue, Jan 03, 2012 at 12:02:22AM -0800, Don Lewis wrote:
 On  2 Jan, Don Lewis wrote:
  On  2 Jan, Don Lewis wrote:
  On  2 Jan, Florian Smeets wrote:
  
  This does not make a difference. I tried on 32K/4K with/without journal
  and on 16K/2K all exhibit the same problem. At some point during the
  cvs2svn conversion the sycer starts to use 100% CPU. The whole process
  hangs at that point sometimes for hours, from time to time it does
  continue doing some work, but really really slow. It's usually between
  revision 21 and 22, when the resulting svn file gets bigger than
  about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
  
  I broke into ddb and ran all commands which i thought could be useful.
  The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt
  
  Tracing command syncer pid 9 tid 100183 td 0xfe00120e9000
  cpustop_handler() at cpustop_handler+0x2b
  ipi_nmi_handler() at ipi_nmi_handler+0x50
  trap() at trap+0x1a8
  nmi_calltrap() at nmi_calltrap+0x8
  --- trap 0x13, rip = 0x8082ba43, rsp = 0xff8000270fe0, rbp = 
  0xff88c97829a0 ---
  _mtx_assert() at _mtx_assert+0x13
  pmap_remove_write() at pmap_remove_write+0x38
  vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
  vm_object_page_clean() at vm_object_page_clean+0x14d
  vfs_msync() at vfs_msync+0xf1
  sync_fsync() at sync_fsync+0x12a
  sync_vnode() at sync_vnode+0x157
  sched_sync() at sched_sync+0x1d1
  fork_exit() at fork_exit+0x135
  fork_trampoline() at fork_trampoline+0xe
  --- trap 0, rip = 0, rsp = 0xff88c9782d00, rbp = 0 ---
  
  I thinks this explains why the r228838 patch seems to help the problem.
  Instead of an application call to msync(), you're getting bitten by the
  syncer doing the equivalent.  I don't know why the syncer is CPU bound,
  though.  From my understanding of the patch it only optimizes the I/O.
  Without the patch, I would expect that the syncer would just spend a lot
  of time waiting on I/O.  My guess is that this is actually a vm problem.
  There are nested loops in vm_object_page_clean() and
  vm_object_page_remove_write(), so you could be doing something that's
  causing lots of looping in that code.
  
  Does the machine recover if you suspend cvs2svn?  I think what is
  happening is that cvs2svn is continuing to dirty pages while the syncer
  is trying to sync the file.  From my limited understanding of this code,
  it looks to me like every time cvs2svn dirties a page, it will trigger a
  call to vm_object_set_writeable_dirty(), which will increment
  object-generation.  Whenever vm_object_page_clean() detects a change in
  the generation count, it restarts its scan of the pages associated with
  the object.  This is probably not optimal ...
 
 Since the syncer is only trying to flush out pages that have been dirty
 for the last 30 seconds, I think that vm_object_set_writeable_dirty()
 should just make one pass through the object, ignoring generation, and
 then return when it is called from the syncer.  That should keep
 vm_object_set_writeable_dirty() from looping over the object again and
 again if another process is actively dirtying the object.
 
This sounds very plausible. I think that there is no sense in restarting
the scan if it is requested in async mode at all. See below.

Would be thrilled if this finally solves the svn2cvs issues.

commit 41aaafe5e3be5387949f303b8766da64ee4a521f
Author: Kostik Belousov kostik@sirion
Date:   Tue Jan 3 11:16:30 2012 +0200

Do not restart the scan in vm_object_page_clean() if requested
mode is async.

Proposed by:truckman

diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 716916f..52fc08b 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -841,7 +841,8 @@ rescan:
if (p-valid == 0)
continue;
if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) {
-   if (object-generation != curgeneration)
+   if ((flags  OBJPC_SYNC) != 0 
+   object-generation != curgeneration)
goto rescan;
np = vm_page_find_least(object, pi);
continue;
@@ -851,7 +852,8 @@ rescan:
 
n = vm_object_page_collect_flush(object, p, pagerflags,
flags, clearobjflags);
-   if (object-generation != curgeneration)
+   if ((flags  OBJPC_SYNC) != 0 
+   object-generation != curgeneration)
goto rescan;
 
/*


pgpCaQxKBZews.pgp
Description: PGP signature

Re: dogfooding over in clusteradm land

2012-01-03 Thread Kostik Belousov

On Tue, Jan 03, 2012 at 01:45:26AM -0800, Don Lewis wrote:
 On  3 Jan, Kostik Belousov wrote:
 
  This sounds very plausible. I think that there is no sense in restarting
  the scan if it is requested in async mode at all. See below.
  
  Would be thrilled if this finally solves the svn2cvs issues.
  
  commit 41aaafe5e3be5387949f303b8766da64ee4a521f
  Author: Kostik Belousov kostik@sirion
  Date:   Tue Jan 3 11:16:30 2012 +0200
  
  Do not restart the scan in vm_object_page_clean() if requested
  mode is async.
  
  Proposed by:truckman
  
  diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
  index 716916f..52fc08b 100644
  --- a/sys/vm/vm_object.c
  +++ b/sys/vm/vm_object.c
  @@ -841,7 +841,8 @@ rescan:
  if (p-valid == 0)
  continue;
  if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) {
  -   if (object-generation != curgeneration)
  +   if ((flags  OBJPC_SYNC) != 0 
  +   object-generation != curgeneration)
  goto rescan;
  np = vm_page_find_least(object, pi);
  continue;
 
 I wonder if it would make more sense to just skip the busy pages in
 async mode instead of sleeping ...
 
It would be too much weakening the guarantee of the vfs_msync(MNT_NOWAIT)
to not write such pages, IMO. Busy state indeed means that the page most
likely undergoing the i/o, but in case it is not, we would not write it
at all.

Lets see whether the change alone helps. Do you agree ?


pgpsejHYZyDCu.pgp
Description: PGP signature

Re: dogfooding over in clusteradm land

2012-01-03 Thread Kostik Belousov

On Tue, Jan 03, 2012 at 02:57:17AM -0800, Don Lewis wrote:
 On  3 Jan, Kostik Belousov wrote:
  On Tue, Jan 03, 2012 at 01:45:26AM -0800, Don Lewis wrote:
  On  3 Jan, Kostik Belousov wrote:
  
   This sounds very plausible. I think that there is no sense in restarting
   the scan if it is requested in async mode at all. See below.
   
   Would be thrilled if this finally solves the svn2cvs issues.
   
   commit 41aaafe5e3be5387949f303b8766da64ee4a521f
   Author: Kostik Belousov kostik@sirion
   Date:   Tue Jan 3 11:16:30 2012 +0200
   
   Do not restart the scan in vm_object_page_clean() if requested
   mode is async.
   
   Proposed by: truckman
   
   diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
   index 716916f..52fc08b 100644
   --- a/sys/vm/vm_object.c
   +++ b/sys/vm/vm_object.c
   @@ -841,7 +841,8 @@ rescan:
if (p-valid == 0)
continue;
if (vm_page_sleep_if_busy(p, TRUE, vpcwai)) {
   -if (object-generation != curgeneration)
   +if ((flags  OBJPC_SYNC) != 0 
   +object-generation != curgeneration)
goto rescan;
np = vm_page_find_least(object, pi);
continue;
  
  I wonder if it would make more sense to just skip the busy pages in
  async mode instead of sleeping ...
  
  It would be too much weakening the guarantee of the vfs_msync(MNT_NOWAIT)
  to not write such pages, IMO. Busy state indeed means that the page most
  likely undergoing the i/o, but in case it is not, we would not write it
  at all.
 
 If the original code detects a busy page, it sleeps and then continues
 with the next page if generation hasn't changed.  If generation has
 changed, then it restarts the scan.
 
 With your change above, the code will skip the busy page after sleeping
 if it is running in async mode.  It won't make another attempt to write
 this page because it no longer attempts to rescan.
Why would it skip it ? Please note the call to vm_page_find_least()
with the pindex of the busy page right after the check for
generation. If a page with the pindex is still present in the object,
vm_page_find_least() should return it, and vm_object_page_clean() should
make another attempt at processing it.

Am I missing something ?
 
 My suggestion just omits the sleep in this particular case.
 
 The syncer should write the page the next time it runs, unless we're
 particularly unlucky ...
 
  Lets see whether the change alone helps. Do you agree ?
 
 Your patch is definitely worth trying as-is.  My latest suggestion is
 probably a minor additional optimization.


pgpe8aTZBd2Ul.pgp
Description: PGP signature

Re: dogfooding over in clusteradm land

2012-01-02 Thread Kostik Belousov

On Mon, Jan 02, 2012 at 12:47:03PM -0800, Don Lewis wrote:
 On  2 Jan, Florian Smeets wrote:
  On 29.12.11 01:04, Kirk McKusick wrote:
  Rather than changing BKVASIZE, I would try running the cvs2svn
  conversion on a 16K/2K filesystem and see if that sorts out the
  problem. If it does, it tells us that doubling the main block
  size and reducing the number of buffers by half is the problem.
  If that is the problem, then we will have to increase the KVM
  allocated to the buffer cache.
  
  
  This does not make a difference. I tried on 32K/4K with/without journal
  and on 16K/2K all exhibit the same problem. At some point during the
  cvs2svn conversion the sycer starts to use 100% CPU. The whole process
  hangs at that point sometimes for hours, from time to time it does
  continue doing some work, but really really slow. It's usually between
  revision 21 and 22, when the resulting svn file gets bigger than
  about 11-12Gb. At that point an ls in the target dir hangs in state ufs.
  
  I broke into ddb and ran all commands which i thought could be useful.
  The output is at http://tb.smeets.im/~flo/giant-ape_syncer.txt
 
 Tracing command syncer pid 9 tid 100183 td 0xfe00120e9000
 cpustop_handler() at cpustop_handler+0x2b
 ipi_nmi_handler() at ipi_nmi_handler+0x50
 trap() at trap+0x1a8
 nmi_calltrap() at nmi_calltrap+0x8
 --- trap 0x13, rip = 0x8082ba43, rsp = 0xff8000270fe0, rbp = 
 0xff88c97829a0 ---
 _mtx_assert() at _mtx_assert+0x13
 pmap_remove_write() at pmap_remove_write+0x38
 vm_object_page_remove_write() at vm_object_page_remove_write+0x1f
 vm_object_page_clean() at vm_object_page_clean+0x14d
 vfs_msync() at vfs_msync+0xf1
 sync_fsync() at sync_fsync+0x12a
 sync_vnode() at sync_vnode+0x157
 sched_sync() at sched_sync+0x1d1
 fork_exit() at fork_exit+0x135
 fork_trampoline() at fork_trampoline+0xe
 --- trap 0, rip = 0, rsp = 0xff88c9782d00, rbp = 0 ---
 
 I thinks this explains why the r228838 patch seems to help the problem.
 Instead of an application call to msync(), you're getting bitten by the
 syncer doing the equivalent.  I don't know why the syncer is CPU bound,
 though.  From my understanding of the patch it only optimizes the I/O.
 Without the patch, I would expect that the syncer would just spend a lot
 of time waiting on I/O.  My guess is that this is actually a vm problem.
 There are nested loops in vm_object_page_clean() and
 vm_object_page_remove_write(), so you could be doing something that's
 causing lots of looping in that code.
r228838 allows the system to skip 50-70% of the code when initiating a
write of the UFS file page, due to async clustering. The system has to
maintain 75% less amount of writes in progress.

 I think that ls is hanging because it's stumbling across the vnode that
 the syncer has locked.
This is the only reasonable explanation.

Low-tech profile is to periodically break out into ddb and do backtrace
for the syncer thread. More advanced techniques is to use dtrace or normal
profiling.


pgpqweDffT9HY.pgp
Description: PGP signature

Re: lang/lua: /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_' can not be used when making a shared object; recompile with -fPIC

2011-12-29 Thread Kostik Belousov

On Thu, Dec 29, 2011 at 12:19:40PM +0100, O. Hartmann wrote:
 Am 12/29/11 11:48, schrieb Rainer Hurling:
  On 28.12.2011 19:31 (UTC+1), Kostik Belousov wrote:
  On Wed, Dec 28, 2011 at 07:21:00PM +0100, O. Hartmann wrote:
  Am 12/28/11 19:10, schrieb Ed Schouten:
  * Rainer Hurlingrhur...@gwdg.de, 20111228 17:31:
  error: macro _Static_assert passed 3 arguments, but takes just 2
  In file included from
  /usr/ports/lang/gcc46/work/gcc-4.6-20111209/libstdc++-v3/include/precompiled/stdc++.h:103:0:
 
 
  Hmmm... This seems to apply to my changes. I will look into this
  tomorrow. Thanks for the report!
 
 
 
  Be aware that the error produced by the linker I mentioned in the
  initial post occurs on FreeBSD 10 as well as FreeBSD 9.0.
 
  I already filed a PR about the problem of a non compiling lang/gcc46
  today (ports/163672: lang/gcc46: make failed for lang/gcc46). For test
  puproses, I rebuild gcc46 on our FreeBSD 9.0 boxes - without any
  problem.
 
  I guess, the commit r228902 has been done to FreeBSD 10.0 and not 9.0.
 
  Obviously, linker error during the compilation of third-party software
  has nothing to do with compiler error occuring when building gcc.
 
  Do people ever read the texts of the messages ?
  
  Kostik, probably you are right. I had read the messages, but there are
  some strange errors with gcc46 on head for two days now, which leaded me
  in the wrong direction. So sorry for erroneously 'hijacking' this thread
  with another problem most certain only existing in head.
  
  
  I found another trail, which hopefully is more usefull for solving the
  problem Oliver described.
  
  Whe I try to build lang/lua I get this error:
  
  [..snip..]
  cc -o liblua.so -O2 -fno-strict-aliasing -pipe -msse3  -Wall
  -DLUA_USE_LINUX   -shared -Wl,-soname=liblua-5.1.so.1 lapi.o lcode.o
  ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o
  lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o
  lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o
  lstrlib.o loadlib.o linit.o
  ar rcu liblua.a lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o
  llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o
  ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o
  lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o
  ranlib liblua.a
  cc -o lua  lua.o liblua.a -lm -Wl,-E -lreadline
  cc -o luac  luac.o print.o liblua.a -lm -Wl,-E -lreadline
  /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_'
  can not be used when making a shared object; recompile with -fPIC
  lapi.o: could not read symbols: Bad value
  *** Error code 1
  
  
  It also gives a linker error, almost the same relocation is named. This
  does only happen with option '-msse3' enabled in /etc/make.conf:
  
  CFLAGS= -O2 -fno-strict-aliasing -pipe -msse3
  
  Using CLFAGS without -msse3 (default) works well:
  
  CFLAGS= -O2 -fno-strict-aliasing -pipe
  
  
  The systems processor, were this happens, is a
  
  CPU: AMD Phenom(tm) II X6 1090T Processor (3214.32-MHz K8-class CPU)
Origin = AuthenticAMD  Id = 0x100fa0  Family = 10  Model = a
  Stepping = 0
  Features=0x178bfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT
  
Features2=0x802009SSE3,MON,CX16,POPCNT
AMD
  Features=0xee500800SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!
AMD
  Features2=0x37ffLAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT
  
TSC: P-state invariant, performance statistics
  
  FreeBSD 10-CURRENT (amd64) r228920
  
  In hope of a more belonging posting,
  Rainer
  ___
 
 
 WOW!
 
 I tried lang/lua on FreeBSD 9.0-PRE and see exactly this error:
 
 clang -O2  -fno-strict-aliasing -pipe -march=core2 -Wall -DLUA_USE_LINUX
 -c print.c
 clang -o liblua.so -O2  -fno-strict-aliasing -pipe -march=core2 -Wall
 -DLUA_USE_LINUX   -shared -Wl,-soname=liblua-5.1.so.1 lapi.o lcode.o
 ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o
 lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o
 lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o
 lstrlib.o loadlib.o linit.o
 /usr/bin/ld: lapi.o: relocation R_X86_64_32 against `luaO_nilobject_'
 can not be used when making a shared object; recompile with -fPIC
 lapi.o: could not read symbols: Bad value
 clang: error: linker command failed with exit code 1 (use -v to see
 invocation)
 *** Error code 1
 ar rcu liblua.a lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o
 llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o
 ltm.o lundump.o lvm.o lzio.o lauxlib.o lbaselib.o ldblib.o liolib.o
 lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o
 ranlib liblua.a
 1 error
 *** Error code 2
 1 error
 *** Error code 2
 1 error
 *** Error code 1
 
 Stop in /usr/ports/lang/lua.
 
 === make failed for lang/lua

Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value

2011-12-28 Thread Kostik Belousov

On Wed, Dec 28, 2011 at 11:48:28AM +0100, O. Hartmann wrote:
 Hello out here.
 
 I run into a problem since one of the last portupdates and I do not know
 whether this has to do with binutils or gcc46 or even FreeBSD 9.0/10.0
 AMD64.
 
 Background:
 We use a scientific graphical toolset for planetary research called
 ISIS3, which is provided by the USGS. We patched ISIS3 to run on FreeBSD
 8/9/10 so far and it ran well with FreeBSD 8.2-STABLE and 9.0-PRE a
 couple of weeks ago.
 On all of my boxes, I do frequently a portupgrade, so I saw binutils got
 bumped up and gcc 4.6 is also getting really frequently changed these days.
 After a some portupdates within the last weeks, I just decided to
 compile ISIS3 again to keep it fresh and on track, but it won't
 compile anymore.
 
 On all FreeBSD 9.0-PRERELEASE and FreeBSD 10.0-CURRENT (all AMD64 and
 CLANG built) I receive in some subfolders containing sources the
 follwoing error:
 
 [...]
 Adding API object [UniqueIOCachingAlgorithm]
 Adding API object [UniversalGroundMap]
 Adding API object [UserInterface]
 Adding API object [VariableLineScanCameraDetectorMap]
 Adding API object [VecFilter]
 Adding API object [WorldMapper]
 Adding API object [iException]
 Adding API object [iString]
 Adding API object [iTime]
   Working on Package [mex] (11:30:15)
 Adding API object [HrscCamera]
 /usr/local/bin/ld:
 /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a(functexcept.o):
 relocation R_X86_64_32 against `std::bad_exception::~bad_exception()'
 can not be used when making a shared object; recompile with -fPIC
 /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a:
 could not read symbols: Bad value
 collect2: ld returned 1 exit status
 gmake[5]: *** [plugin] Error 1
 cp: libHrscCamera.so: No such file or directory
 gmake[4]: *** [install] Error 1
The error is completely clear as it is: the build tries to link static
library libstdc++.so into shared object. This is not supported.


pgpqYjyBtTIwA.pgp
Description: PGP signature

Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value

2011-12-28 Thread Kostik Belousov

On Wed, Dec 28, 2011 at 03:10:12PM +0100, O. Hartmann wrote:
Am 12/28/11 14:58, schrieb Kostik Belousov:
On Wed, Dec 28, 2011 at 11:48:28AM +0100, O. Hartmann wrote:
Hello out here.

I run into a problem since one of the last portupdates and I do not know
whether this has to do with binutils or gcc46 or even FreeBSD 9.0/10.0
AMD64.

Background:
We use a scientific graphical toolset for planetary research called
ISIS3, which is provided by the USGS. We patched ISIS3 to run on FreeBSD
8/9/10 so far and it ran well with FreeBSD 8.2-STABLE and 9.0-PRE a
couple of weeks ago.
On all of my boxes, I do frequently a portupgrade, so I saw binutils got
bumped up and gcc 4.6 is also getting really frequently changed these days.
After a some portupdates within the last weeks, I just decided to
compile ISIS3 again to keep it fresh and on track, but it won't
compile anymore.

On all FreeBSD 9.0-PRERELEASE and FreeBSD 10.0-CURRENT (all AMD64 and
CLANG built) I receive in some subfolders containing sources the
follwoing error:

[...]
Adding API object [UniqueIOCachingAlgorithm]
Adding API object [UniversalGroundMap]
Adding API object [UserInterface]
Adding API object [VariableLineScanCameraDetectorMap]
Adding API object [VecFilter]
Adding API object [WorldMapper]
Adding API object [iException]
Adding API object [iString]
Adding API object [iTime]
Working on Package [mex] (11:30:15)
Adding API object [HrscCamera]
/usr/local/bin/ld:
/usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a(functexcept.o):
relocation R_X86_64_32 against `std::bad_exception::~bad_exception()'
can not be used when making a shared object; recompile with -fPIC
/usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a:
could not read symbols: Bad value
collect2: ld returned 1 exit status
gmake[5]: *** [plugin] Error 1
cp: libHrscCamera.so: No such file or directory
gmake[4]: *** [install] Error 1
The error is completely clear as it is: the build tries to link static
library libstdc++.so into shared object. This is not supported.

Thanks, Kostik, for the fast response.
The error isn't so clear to me, sorry. I thought libstdc++.a is the
static library and it is taken to be referenced/compiled into a shared
Linked in.

object created by the application I try to compile.
Right, and this is not supported. Code linked into shared object must
be compiled PIC. An .a library usually does not contain objects compiled
by PIC, ld just dutifully reported back.

I'm much more confused now, since I thought the last time I compiled
that piece of software, I never got any error like that. Well, clang
fails with some obscure errors on the code itself and I'm unwilling to
correct them, I'll try the legacy gcc 4.2.1 and will report what's
happening.

It might have worked by accident (because libstdc++.a objects referenced
during the link did not carried unsupported relocations), or, much more
likely, the build system has changed and started doing stupid things.
It must not link static libraries into shared objects.

You should examine why it does this, and fix it. Changing compilers is
just wasting a time.

pgpkHA6aHIGk1.pgp
Description: PGP signature

Re: /usr/local/lib/gcc46/gcc/x86_64-portbld-freebsd9.0/4.6.3/../../../libstdc++.a: could not read symbols: Bad value

2011-12-28 Thread Kostik Belousov

On Wed, Dec 28, 2011 at 07:21:00PM +0100, O. Hartmann wrote:
 Am 12/28/11 19:10, schrieb Ed Schouten:
  * Rainer Hurling rhur...@gwdg.de, 20111228 17:31:
  error: macro _Static_assert passed 3 arguments, but takes just 2
  In file included from 
  /usr/ports/lang/gcc46/work/gcc-4.6-20111209/libstdc++-v3/include/precompiled/stdc++.h:103:0:
  
  Hmmm... This seems to apply to my changes. I will look into this
  tomorrow. Thanks for the report!
  
 
 
 Be aware that the error produced by the linker I mentioned in the
 initial post occurs on FreeBSD 10 as well as FreeBSD 9.0.
 
 I already filed a PR about the problem of a non compiling lang/gcc46
 today (ports/163672: lang/gcc46: make failed for lang/gcc46). For test
 puproses, I rebuild gcc46 on our FreeBSD 9.0 boxes - without any problem.
 
 I guess, the commit r228902 has been done to FreeBSD 10.0 and not 9.0.

Obviously, linker error during the compilation of third-party software
has nothing to do with compiler error occuring when building gcc.

Do people ever read the texts of the messages ?


pgpXZQ7ftuzPn.pgp
Description: PGP signature

Re: SU+J systems do not fsck themselves

2011-12-28 Thread Kostik Belousov

On Wed, Dec 28, 2011 at 11:14:19AM -0800, m...@freebsd.org wrote:
 SU doesn't care about write ordering, as long as everything before a
 BIO_FLUSH is really flushed by the time the BIO_FLUSH is acknowledged.
No.

SU and SU+J only require that write completed notification is issued
when geom/driver layer guarantees that the block content is written
to the stable storage.

SU does not depend on non-reordering of writes in any way, as well as
it does not issue BIO_FLUSH.


pgpkZdtAOt9Lt.pgp
Description: PGP signature

Re: [patch] Cleaning up amd64 kernel optimization options

2011-12-23 Thread Kostik Belousov

On Fri, Dec 23, 2011 at 06:03:42PM +0100, Dimitry Andric wrote:
 On 2011-12-23 17:00, Alexander Best wrote:
 ...
 Back in the 7.x days, I ran into some code that wasn't easily to debug 
 because the compiler optimized things out with -O2 by inlining and
 otherwise shifting around code, so setting breakpoints in gdb became 
 difficult. So from that point on I've gotten into the habit of doing -O
 explicitly in make.conf if DEBUG_FLAGS was specified. Just a thought..
 
 I still leave -O2 in, but what I do is this:
 
make DEBUG_FLAGS=-g -fno-inline
 
 would making -O2 -fno-inline the default flags introduce any major 
 slowdown?
 
 Not major, but a minor slowdown is quite possible, especially on arches
 like x86, where call overhead is relatively high, and code caches are
 relatively large.
 
 Anyway, I'd rather get the thread back to my original patch instead of
 messing around with the default flags for release, which have been -O2
 for a long time now.  If people want to override those for specific
 reasons, they can always set COPTFLAGS or DEBUG_FLAGS manually, like
 John shows.
 
 The only thing my patch makes sure of, is that amd64 does the same thing
 as all other arches, e.g.: compile with a low optimization settings for
 debug (-O, which is equivalent to -O1), compile with arch-specific high
 optimization settings for release (-O2 plus whatever is required for the
 arch, or lower if optimization breaks things).

Release is built with -g for long time, this is where the symbol files
in /boot/kernel comes from.


pgp5KdmN3MSqu.pgp
Description: PGP signature

Re: [patch] Cleaning up amd64 kernel optimization options

2011-12-23 Thread Kostik Belousov

On Fri, Dec 23, 2011 at 08:04:32PM +0100, Dimitry Andric wrote:
 On 2011-12-23 18:55, Kostik Belousov wrote:
 On Fri, Dec 23, 2011 at 06:03:42PM +0100, Dimitry Andric wrote:
 ...
 The only thing my patch makes sure of, is that amd64 does the same thing
 as all other arches, e.g.: compile with a low optimization settings for
 debug (-O, which is equivalent to -O1), compile with arch-specific high
 optimization settings for release (-O2 plus whatever is required for the
 arch, or lower if optimization breaks things).
 
 Release is built with -g for long time, this is where the symbol files
 in /boot/kernel comes from.
 
 Ah, that is done via 'makeoptions DEBUG=-g' in the kernel configuration
 file, right?  I didn't realize that was kept in for a release.  But even
 in that case, amd64 is somehow different from the other arches, which
 all get compiled with -O instead.
Yes.

 
 If people prefer that to stay as it is, I'll change the diff so only
 -frename-registers gets removed when clang is used, as clang does not
 support this flag.

This question cannot be answered without measurement. I think that even
the 'default' benchmark of buildworld over -O and -O2 kernels can be useful
to continue the discussion.


pgpCNEdA1st9j.pgp
Description: PGP signature

Re: extattr_set_*() return type

2011-12-21 Thread Kostik Belousov

On Wed, Dec 21, 2011 at 10:31:11AM -0500, John Baldwin wrote:
 On Tuesday, December 20, 2011 5:18:58 pm m...@freebsd.org wrote:
  On Tue, Dec 20, 2011 at 1:49 PM, John Baldwin j...@freebsd.org wrote:
   Hmm, if these functions are expected to operate like 'write(2)' and are
   supposed to return the number of bytes written, shouldn't their return 
   value
   be 'ssize_t' instead of 'int'?  It looks like the system calls themselves
   already do the right thing in setting td_retval[] (they assign a ssize_t 
   to it
   and td_retval[0] can hold a ssize_t on all of our current platforms).  It
   would seem that the only change would be to the header and probably
   syscalls.master.  I guess this would require a symver bump to fix though.
  
  An extended attribute larger than 2GB is a programming abuse, though.
  Technically int may not be 32 bits but it is on all supported
  platforms now.
 
 Today it is an abuse.  In the 90's a 64-bit off_t was considered an abuse by
 some. :)
 
 The type should match the documented behavior.  On OS X the set operation
 doesn't return a size but instead returns a simple success/failure (0 or -1)
 for which an int is appropriate.  However, the FreeBSD API documents that it
 operates like write and consumes the buffer.   Note that the size of the
 buffer passed to the 'set' and 'get' operations is a size_t, not an int, and
 the 'get' operations already return a ssize_t, not an int.

Note that read(2)/write(2) do return int. I still have WIP patch to fix
this, but after some conversations with Bruce I am not sure it is worth
finishing.


pgpGVG18c7Kk7.pgp
Description: PGP signature

Re: extattr_set_*() return type

2011-12-21 Thread Kostik Belousov

On Wed, Dec 21, 2011 at 12:25:18PM -0500, John Baldwin wrote:
 On Wednesday, December 21, 2011 11:13:10 am Kostik Belousov wrote:
  On Wed, Dec 21, 2011 at 10:31:11AM -0500, John Baldwin wrote:
   On Tuesday, December 20, 2011 5:18:58 pm m...@freebsd.org wrote:
On Tue, Dec 20, 2011 at 1:49 PM, John Baldwin j...@freebsd.org wrote:
 Hmm, if these functions are expected to operate like 'write(2)' and 
 are
 supposed to return the number of bytes written, shouldn't their 
 return value
 be 'ssize_t' instead of 'int'?  It looks like the system calls 
 themselves
 already do the right thing in setting td_retval[] (they assign a 
 ssize_t to it
 and td_retval[0] can hold a ssize_t on all of our current platforms). 
  It
 would seem that the only change would be to the header and probably
 syscalls.master.  I guess this would require a symver bump to fix 
 though.

An extended attribute larger than 2GB is a programming abuse, though.
Technically int may not be 32 bits but it is on all supported
platforms now.
   
   Today it is an abuse.  In the 90's a 64-bit off_t was considered an abuse 
   by
   some. :)
   
   The type should match the documented behavior.  On OS X the set operation
   doesn't return a size but instead returns a simple success/failure (0 or 
   -1)
   for which an int is appropriate.  However, the FreeBSD API documents that 
   it
   operates like write and consumes the buffer.   Note that the size of the
   buffer passed to the 'set' and 'get' operations is a size_t, not an int, 
   and
   the 'get' operations already return a ssize_t, not an int.
  
  Note that read(2)/write(2) do return int. I still have WIP patch to fix
  this, but after some conversations with Bruce I am not sure it is worth
  finishing.
 
 The manpages and /usr/include/unistd.h claim they return ssize_t.  Is this
 related to the changes to make uio_resid a size_t (I thought that went into
 the tree)?  If the problem is that the values read/write return may fall into
 the range of only an int even on 64-bit platforms, that is different from the
 return type which is part of the ABI.
Yes, it is related. The type change for uio was done in advance.

Take a look at the first statement of sys_read() and sys_write():
if (uap-nbyte  INT_MAX)
return (EINVAL);
and at the copyinio(), which is used by scatter/gather versions of i/o
syscalls to copy in uiovec:
if (iov-iov_len  INT_MAX - uio-uio_resid) {
free(uio, M_IOV);
return (EINVAL);



pgp0QFlBvfWRz.pgp
Description: PGP signature

Re: calling all fs experts

2011-12-11 Thread Kostik Belousov

On Sat, Dec 10, 2011 at 05:42:01PM -0800, Maksim Yevmenkin wrote:
 Hello,
 
 i have a question for fs wizards.
 
 suppose i can persuade modern spinning disk to do large reads (say
 512K to 1M) at a time. also, suppose file system on such modern
 spinning drive is used to store large files (tens to hundreds of
 megabytes). is there any way i can tweak the file system parameters
 (block size, layout, etc) to help it to get as close to disk's
 sequential read rate as possible. I understand that i will not be
 able to get 100MB/sec single client sequential read rate, but, can i
 get it into sustained 40-50MB/sec rate? also, can i reduce performance
 impact caused by small reads such as directory access etc.

If you wanted to get responses from experts only, sorry in advance.

The fs (AKA UFS) uses clustering provided by the block cache. The clustering
code, mainly located in the kern/vfs_cluster.c, coalesces sequence of
reads or writes that are targeting the consequtive blocks, into single
physical read or write of the maximal size of MAXPHYS. Current definition
of MAXPHYS is 128KB.

Clustering allows filesystem to improve the layout of the files by calling
VOP_REALLOCBLKS() to redo the allocation to make the writing sequence of
blocks sequential if it is not.

Even if file is not layed out ideally, or the i/o pattern is random, most
writes scheduled are asynchronous, and for reads, the system tries to
schedule read-aheads for some limited number of blocks. This allows the
lower layers, i.e. geom and disk drivers, to optimize the i/o queue
to coalesce requests that are consequitive on disk, but not on the queue.

BTW, some time ago I was interested in the effect on the fragmentation
on UFS, due to some semi-abandoned patch, which could make the
fragmentation worse. I wrote the tool that calculated the percentage
of non-consequtive spots in the whole filesystem. Apparently, even
under the hard load consisting of writing a lot of files under the
megabytes in size, UFS managed to keep the number of spots under 2-3% on
sufficiently free volume.


pgpg2apEuMeNy.pgp
Description: PGP signature

Re: Adding bool, true, and false to the kernel

2011-12-09 Thread Kostik Belousov

On Thu, Dec 08, 2011 at 04:36:58PM -0800, Matthew Fleming wrote:
 This code should be MFC-able to stable/9 after 9.0 is released.
Yes, please make sure to merge it to stable/9.



pgpJdscbva0LS.pgp
Description: PGP signature

Re: running old binaries on -current

2011-12-04 Thread Kostik Belousov

On Sun, Dec 04, 2011 at 08:54:27PM +0100, Yamagi Burmeister wrote:
 On Fri, 2 Jul 2010 21:44:26 +0300
 Kostik Belousov kostik...@gmail.com wrote:
 
  On Fri, Jul 02, 2010 at 11:12:13AM -0700, Julian Elischer wrote:
   every now and then, for fun I run up a chroot of freebsd 1.1. or 1.0 
   under a chroot.  Usually hillarity ensues with teh 15 second kernel 
   compile and the 4 minute make world.
   
   in -current I can't do that any more.. any binary just exits with 'Abort'.
   
   I think I last tried it in 7.0 or there abouts.
   
   I have options COMPAT_AOUT and COMPAT_FREEBSD4 through COMPAT_FREEBSD7
   
   does anyone else have any ideas as to what may be needed?
   
   I vaguely remember another option but I am not seeing it at the moment.
   
   For those of you who do not remember, 1.0 had a.out static binaries only.
   
  
  Can you ktrace/kdump the run attempt, I suspect that abort is
  sent by image activator.
  
  Also, please put some static binary somewhere to download.
 
 Hi,
 digging around I found one of the first programs ever written. It's an
 static aout binary from 1994 or so, running it on FreeBSD 8.2 shows
 exactly this problem. Some more extensive tests narrowed this problem
 down to FreeBSD 1.x binaries, they're all broken. FreeBSD 2 aout
 binaries are still working. 
 Since Google found this rather old thread I decided to provide the data
 your requested, but I do not insist on a fix. I guess that nobody will
 try to run FreeBSD 1.x binaries these day... So if you want to spend
 some work on this I'll test patches but if you decide to do nothing
 it's okay.
 
 This is the /bin/sh of FreeBSD 1.0 (taken from the official
 installation cd image f?r i386):
 
 % file sh
 sh: VAX demand paged pure executable
 
 % ./sh
 Abort
 
 The ktrace / kdump:
   1065 ktrace   RET   ktrace 0
   1065 ktrace   CALL  execve(0xbfbfee1b,0xbfbfecf8,0xbfbfed00)
   1065 ktrace   NAMI  ./sh
 
 I've uploaded the binary to:
 http://deponie.yamagi.org/freebsd/tmp/sh_freebsd1
Try to set sysctl security.bsd.map_at_zero to 1.


pgpg2AX8U5A4g.pgp
Description: PGP signature

Re: r227487 breaks C++ programs that use __isthreaded

2011-12-02 Thread Kostik Belousov

On Thu, Dec 01, 2011 at 04:23:11PM -0500, David Schultz wrote:
 On Thu, Dec 01, 2011, George Liaskos wrote:
  Hello
  
  One example is Google's tcmalloc [1], is this behaviour intended?
  
  [1] 
  http://code.google.com/p/google-perftools/source/browse/trunk/src/maybe_threads.cc
 
 This code uses an unportable workaround for a bug that I believe
 was fixed in r227999.  Using internal names starting with a double
 underscore isn't supported.
 
 Separately, I'm still hoping that the namespace polution
 introduced in r227487 gets fixed...
I handled the pthread_once mess in libunwind without relying on
__isthreaded, but I indeed only needed once working.

http://git.savannah.gnu.org/cgit/libunwind.git/commit/?id=08077a4962c4e606598f9f0e54b515b3c882be10


pgpg3qjpoFMCo.pgp
Description: PGP signature

Re: [patch] turning devctl into a multiple openable device

2011-12-02 Thread Kostik Belousov

On Fri, Dec 02, 2011 at 12:17:21AM +0100, Baptiste Daroussin wrote:
 On Wed, Nov 30, 2011 at 05:20:17PM +0100, Olivier Houchard wrote:
  On Wed, Nov 30, 2011 at 06:04:50PM +0200, Kostik Belousov wrote:
 I wonder why the waiting_threads stuff is needed at all. The cv could
 be woken up unconditionally everytime. What is the reason for the 
 cv_wait
 call in cdevpriv data destructor ? You cannot have a thread doing e.g.
 read on the file descriptor while destructor is run.
 

What will prevent you from having a thread stuck in read(), while an 
another 
one close() the fd ?

   Nothing, but file reference count goes to zero only after the thread
   stuck in read is unstuck. Cdevpriv destructor is run only when file
   reference count becomes zero, i.e. there can be no any accessing threads,
   and new accesses are impossible since file descriptors also own references
   on the file.
  
  Right, I was a bit confused, this part can be removed.
  
  Regards,
  
  Olivier
 
 Here is a new version of the patch mostly reworked by Olivier,
 It doesn't duplicate anymore the devq, and fix all that have been
 spotted here previously.
 
 http://people.freebsd.org/~bapt/devctl_multi_open.diff
 
 bonus, it removes the needless giant lock
I do not see a need in the use of refcount KPI, since it seems that
all manipulations of n-refcount happen under global_devctl.mtx.
Am I missed the place ?

Releasing refcount on dev_event_info before using it in devread()
allows other thread to free the memory while current thread still
uses it.

Stylish notes:
mtx member in struct global_devctl has weird indentation;
please move declaration of n2 into the declaration block of devdtor();
devread():should_free is initialized at the declaration site;
devctl_queue_data_f() has stray empty line ater the while { loop line,
while need blank line at the start of devctl_queue_data() is removed.


pgp5tdbw3yh7H.pgp
Description: PGP signature

Re: Incorrect tag= in /usr/src/share/examples/cvsup/sta(ble|ndard) in 9.0-PRERELEASE?

2011-12-01 Thread Kostik Belousov

On Thu, Dec 01, 2011 at 12:12:18PM +0300, Sergey Kandaurov wrote:
 On 1 December 2011 10:20, Milan Obuch freebsd-curr...@dino.sk wrote:
  On Tue, 29 Nov 2011 19:22:39 +0300
  Sergey Kandaurov pluk...@gmail.com wrote:
 
  On 29 November 2011 20:16, Maxim Khitrov m...@mxcrypt.com wrote:
   On Tue, Nov 29, 2011 at 10:30 AM, Sergey Kandaurov
   pluk...@gmail.com wrote:
   On 26 November 2011 11:44, Milan Obuch freebsd-curr...@dino.sk
   wrote:
   Hi,
  
   I am playing a bit with 9.0-PRERELEASE compiling it from source
   updated via csup. In both example files there is line specifying
   what to csup
  
   *default release=cvs tag=RELENG_8
  
   which is incorrect, I think. It is convenient for me to issue just
  
   csup -h cvsup.freebsd.sk /usr/share/examples/cvsup/stable-supfile
  
   to update full sources without need to create any cvsup config
   file, however in system installed from 9.0 snapshot (maybe two
   weeks old) this file points to version 8 files, so I need to
   correct it for 9.0-PRERELEASE to not accidentally download older
   version sources.
  
   The same is also true after upgrade from source - make
   installworld install example files pointing to older version...
  
   Is it something I do not know about or is it an oversight? I
   think this line should already be changed to new tag...
  
   *default release=cvs tag=RELENG_9
  
   Hi.
  
   Fixed. Thanks for your report.
   Now cvs tag points to RELENG_9 in 9.x sources.
  
   Should standard-supfile also be updated to point to RELENG_9_0? I'm
   using csup with tag=RELENG_9_0 and standard-supfile still points
   to HEAD.
 
  Yep, sure.
  I just sent a request to the Release Engineering Team.
 
 
  It works for me now as expected, thanks.
 
  Anyway, there is a question what the difference between stable-supfile
  and standard-supfile should be. I looked in my local csupped sources,
  they are the same in 6-STABLE (OK, some history here), 7-STABLE,
  8-STABLE and 9-STABLE. Are they expected to be used differently?
 
 In STABLE branches standard-supfile and stable-supfile are used to have
 the same cvs tag. FYI, compare how it is done in RELEASE branches.
 
  And, second one - what about CURRENT? In stable-supfile I see
  tag=RELENG_9 which is not quite clear, but just for some pedantry... I
  use standard-supfile for CURRENT, so this is not an issue for me either.
 
 To my knowledge, in CURRENT a standard-supfile's cvs tag should be
 read as the latest (i.e. the most recently created) stable branch.
Could the supfiles be generated from some value in newvers.sh ?


pgpWdpj5GwCx0.pgp
Description: PGP signature

Re: [patch] turning devctl into a multiple openable device

2011-11-30 Thread Kostik Belousov

On Wed, Nov 30, 2011 at 10:05:11AM -0500, John Baldwin wrote:
 On Wednesday, November 30, 2011 7:43:20 am Baptiste Daroussin wrote:
  Hi all,
  
  With the help of cognet, I wrote a patch to turn devctl into a multiple 
  openable
  device, that mean that it will allow to open /dev/devctl in multiple 
  programs,
  for example hald and everythings that want to receive notification from the
  device won't need to depend on haveing devd running.
  
  here is the patch: 
  http://people.freebsd.org/~bapt/devctl_multi_open.diff
 
 Shouldn't devctl_queue_data_f() use the requested malloc() flags instead of
 hardcoding M_NOWAIT?
This is an obvious fallback of holding mutex around the call to
per_devctl_queue_data_f(), which caused the author a trouble to use
M_WAITOK.

Having n readers causes the patch to queue each message n times, that looks
like a waste.

I wonder why the waiting_threads stuff is needed at all. The cv could
be woken up unconditionally everytime. What is the reason for the cv_wait
call in cdevpriv data destructor ? You cannot have a thread doing e.g.
read on the file descriptor while destructor is run.

 
 Also, I know that it was an intentional design decisison by Warner to have
 the multiplexing of devctl data done in userland via devd rather than in the
 kernel.  (I think he envisioned devd providing a UNIX domain socket or some
 such for other daemons to use to listen to events.)  Have you asked him about
 this change?
And I fully agree that doing multiplexing in user mode is the right approach.
Not least because you could apply some advanced access control and provide
filtering for the consumers.


pgpL1GaOrgCj6.pgp
Description: PGP signature

Re: [patch] turning devctl into a multiple openable device

2011-11-30 Thread Kostik Belousov

On Wed, Nov 30, 2011 at 04:55:21PM +0100, Olivier Houchard wrote:
 On Wed, Nov 30, 2011 at 05:46:36PM +0200, Kostik Belousov wrote:
  On Wed, Nov 30, 2011 at 10:05:11AM -0500, John Baldwin wrote:
   On Wednesday, November 30, 2011 7:43:20 am Baptiste Daroussin wrote:
Hi all,

With the help of cognet, I wrote a patch to turn devctl into a multiple 
openable
device, that mean that it will allow to open /dev/devctl in multiple 
programs,
for example hald and everythings that want to receive notification from 
the
device won't need to depend on haveing devd running.

here is the patch: 
http://people.freebsd.org/~bapt/devctl_multi_open.diff
   
   Shouldn't devctl_queue_data_f() use the requested malloc() flags instead 
   of
   hardcoding M_NOWAIT?
  This is an obvious fallback of holding mutex around the call to
  per_devctl_queue_data_f(), which caused the author a trouble to use
  M_WAITOK.
  
  Having n readers causes the patch to queue each message n times, that looks
  like a waste.
  
 
 Queuing the message only one time would require to somehow keep a state, to
 know which thread read which message, and figuring out when to free a message
 can be an headache. Given I don't think they'll be a lot of readers, I'm not
 sure it's worth the trouble.
 
  I wonder why the waiting_threads stuff is needed at all. The cv could
  be woken up unconditionally everytime. What is the reason for the cv_wait
  call in cdevpriv data destructor ? You cannot have a thread doing e.g.
  read on the file descriptor while destructor is run.
  
 
 What will prevent you from having a thread stuck in read(), while an another 
 one close() the fd ?
 
Nothing, but file reference count goes to zero only after the thread
stuck in read is unstuck. Cdevpriv destructor is run only when file
reference count becomes zero, i.e. there can be no any accessing threads,
and new accesses are impossible since file descriptors also own references
on the file.


pgpc9tuCURiWe.pgp
Description: PGP signature

Re: panic: handle_written_inodeblock: live inodedep

2011-11-26 Thread Kostik Belousov

On Fri, Nov 25, 2011 at 09:08:27PM +0400, Gleb Smirnoff wrote:
   Hi!
 
   Running HEAD with KMS patches from kib@, a 15th November
 snapshot build from kms branch from ~kib/deviant2 repository.
 The panic message and backtrace look unrelated to KMS, so,
 I think my kernel can be considered a head from 15-th of November.
 
 Panic happened a couple of seconds after system finished booting
 up to login prompt. I didn't login yet, and X server hadn't been
 started.
 
 Backtrace from gdb:
 
 #9  0x8049f5d0 in panic (fmt=Variable fmt is not available.
 )
 at /usr/home/glebius/src/deviant2/sys/kern/kern_shutdown.c:600
 #10 0x806a3b19 in softdep_disk_write_complete (bp=0xff80eff76500)
 at /usr/home/glebius/src/deviant2/sys/ufs/ffs/ffs_softdep.c:11020
 #11 0x8051d97c in bufdone_finish (bp=0xff80eff76500) at buf.h:418
 ---Type return to continue, or q return to quit---
 #12 0x8051dcb8 in bufdone (bp=0xff80eff76500)
 at /usr/home/glebius/src/deviant2/sys/kern/vfs_bio.c:3328
 #13 0x8043ac96 in g_io_schedule_up (tp=Variable tp is not available.
 )
 at /usr/home/glebius/src/deviant2/sys/geom/geom_io.c:679
 #14 0x8043b21c in g_up_procbody (arg=Variable arg is not available.
 )
 at /usr/home/glebius/src/deviant2/sys/geom/geom_kern.c:97
 #15 0x80472d5f in fork_exit (
 callout=0x8043b1c0 g_up_procbody, arg=0x0, 
 frame=0xff8000264c50)
 at /usr/home/glebius/src/deviant2/sys/kern/kern_fork.c:995
 #16 0x806f1a9e in fork_trampoline ()
 at /usr/home/glebius/src/deviant2/sys/amd64/amd64/exception.S:602
 
 I'm running SUJ on all partitions. On the next boot after panic fsck
 failed to run w/o -y option.
 I can provide any additional information or share core file.

Did you have i/o errors before the panic ?


pgpy6ZMCZQDZk.pgp
Description: PGP signature

Re: RLIMIT_DATA and malloc(3) use of mmap(2)

2011-11-24 Thread Kostik Belousov

On Thu, Nov 24, 2011 at 09:38:19PM +0400, Maxim Konovalov wrote:
 Hi Kostik,
 
 On Wed, 23 Nov 2011, 11:22+0400, Maxim Konovalov wrote:
 
  [...]
   Anyway, the patch needs testers before I will push it forward.
  
  [igor's email was corrected]
 
  We will test it in out environment and let you know.
 
 It seems I don't understand how it works.
 
 Here is a test program: http://maxim.int.ru/stuff/malloc_test.c.txt
 
 It allocates successfully 64x1MB chunks by malloc(), reallocf() and
 realloc() with the following command line:
 
 MALLOC_OPTIONS=JM limits -d 10m ./malloc_test
 
 When we add L flag to the MALLOC_OPTIONS it starts to work strange:
 
 It allocates 64MB via malloc() but fails to allocate more than 3MB by
 realloc() and reallocf().
The non-failing 64MB allocation is easily explained by a bug.
I forgot to account for the case when existing map entry was expanded,
instead of new entry created due to mmap(2).

The other part, in particular, the failure after 3MB, is in fact the
correct behaviour. Jemalloc() caches allocation of the pages, and it
allocates more then asked in the request. ktrace(1) shows the whole
history.

Malloc() first allocated 4MB for the needs of libc etc. Then, it allocated
4MB chunk which was used for satisfying the requests of 1-3MB. When the
4MB request came in, the allocation for 8MB was attempted, and failed,
since 4MB (libc etc) + 8MB = 12MB  data limit.
 
 More funny, the result varies from time to time:
I have no explanation for this, and cannot reproduce the issue.
Look at the ktrace.

Overall, the test is quite curious but absolutely unrealistic.
Fixed patch is available at
http://people.freebsd.org/~kib/misc/map_datalimit.2.patch


pgpKGquyhMOov.pgp
Description: PGP signature

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers

2011-11-22 Thread Kostik Belousov

On Mon, Nov 21, 2011 at 06:39:26PM +0100, Robert Millan wrote:
 (replying with Debian hat this time)
 
 2011/11/21 Kostik Belousov kostik...@gmail.com:
  There are some implementations that
  use FreeBSD kernel, and which could potentially benefit from providing
  its own value for __FreeBSD_kernel.
 
 Actually, we wouldn't be able to provide a different value for the
 macro (whatever its name). Our compiler simply doesn't know which
 version of the kernel it is building for. Only the headers do, but if
 we define it in the headers we'd just use the FreeBSD definitions.
 
 Our compiler defines __FreeBSD_kernel__ as an empty macro, I don't
 expect this will change because unlike with FreeBSD, on Debian there
 are strong technical limitations to making it a number.
 
 If __FreeBSD_kernel__ is to be defined in FreeBSD land, may I suggest
 that it is defined as an empty macro as well? This covers the vast
 majority of cases (e.g. like the ones in my initial patch which
 started this discussion), and it doesn't preclude the possibility that
 this macro becomes a number without breaking backward compatibility.
 
 OTOH once you define it as a number, it becomes relevant whether you
 did it with #ifndef or with #undef, and so you have to commit to it.
 
 Just to make it clear: It's no problem to me if it's defined as a
 number, but it doesn't help much either. At least from Debian POV it's
 not worth making a big argument about. It could be a good idea from
 FreeBSD POV, but please only do this if it's useful to FreeBSD.
 
I am fine with __FreeBSD_kernel being empty, please submit the patch.


pgpwSRj8FEROM.pgp
Description: PGP signature

RLIMIT_DATA and malloc(3) use of mmap(2)

2011-11-22 Thread Kostik Belousov

I was reminded about the patch I wrote for Igor Sysoev some time ago.
The issue the patch tries to handle is that jemalloc uses mmap() instead
of sbrk() for pages allocation, and thus RLIMIT_DATA limit is no longer
effective to put the bounds on the process heap. Since reverting to sbrk
for such purpose is worse then the issue itself, I proposed a solution
of 'self-restricting malloc'.

The patch adds a flag MAP_DATALIMIT to the flags argument of mmap(2),
which instructs the system to account the mapping in the RLIMIT_DATA
resource count. The malloc(3) also gets new option 'L' to enable
passing MAP_DATALIMIT to mmap() when allocating pages. By default,
the 'L' option is not activated.

Now, if user wants to ensure that process heap is restricted by the
ulimit -d and still use mmap() for jemalloc, he supplies the option
using any mechanism. The behaviour is voluntaristic, to prevent the
trashing use RLIMIT_SWAP.

Do people consider the facility useful ? Any comments for the patch itself ?

http://people.freebsd.org/~kib/misc/map_datalimit.1.patch



pgpX3zgzJhYr5.pgp
Description: PGP signature

Re: RLIMIT_DATA and malloc(3) use of mmap(2)

2011-11-22 Thread Kostik Belousov

On Tue, Nov 22, 2011 at 07:43:57PM +0400, Maxim Dounin wrote:
 Hello!
 
 On Tue, Nov 22, 2011 at 02:44:10PM +0200, Kostik Belousov wrote:
 
  I was reminded about the patch I wrote for Igor Sysoev some time ago.
  The issue the patch tries to handle is that jemalloc uses mmap() instead
  of sbrk() for pages allocation, and thus RLIMIT_DATA limit is no longer
  effective to put the bounds on the process heap. Since reverting to sbrk
  for such purpose is worse then the issue itself, I proposed a solution
  of 'self-restricting malloc'.
 
 Just a little clarification for others: currently, there is no way 
 to *safely* limit memory usage of a process while using jemalloc 
 with mmap().
 
 The only thing available is RLIMIT_VMEM, but it's not safe as it 
 may be reached on stack grow (leaving no possibility for an 
 application to handle this).
 
  The patch adds a flag MAP_DATALIMIT to the flags argument of mmap(2),
  which instructs the system to account the mapping in the RLIMIT_DATA
  resource count. The malloc(3) also gets new option 'L' to enable
  passing MAP_DATALIMIT to mmap() when allocating pages. By default,
  the 'L' option is not activated.
  
  Now, if user wants to ensure that process heap is restricted by the
  ulimit -d and still use mmap() for jemalloc, he supplies the option
  using any mechanism. The behaviour is voluntaristic, to prevent the
  trashing use RLIMIT_SWAP.
  
  Do people consider the facility useful ?
 
 Yes, at least some way to safely limit process memory usage is 
 certainly needed.
 
 It's a bit sad this isn't enabled by default, but it's probably 
 too late for this.  RLIMIT_DATA was (almost) a noop for too long 
 and making it work again to limit all memory allocations will 
 break POLA.
 
  Any comments for the patch itself ?
  
  http://people.freebsd.org/~kib/misc/map_datalimit.1.patch
 
 Patch looks ok for me (though I'm not a VM expert).
 
 Another possible aproach would be to introduce separate anonymous 
 (private?) mmap limit, this will allow to do essentially the same 
 thing in a bit more consistent (IMHO) manner.

This is already done in some form as the per-user swap limit. Converting
swap limit to the per-process limit raises the architectural questions,
due to shadow chains of the backing vm objects. In particular, we have
to account the anonymous memory used for backing the changed pages
from the writeable private mapping of the files. Similar issues appear
due to fork().

Anyway, the patch needs testers before I will push it forward.


pgpJsHif0AqVL.pgp
Description: PGP signature

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers

2011-11-21 Thread Kostik Belousov

On Mon, Nov 21, 2011 at 01:45:29PM +1100, Bruce Evans wrote:
 On Sun, 20 Nov 2011, Kostik Belousov wrote:
 
 On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote:
 On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote:
 I fully agree with an idea that compiler is not an authorative source
 of the knowledge of the FreeBSD version. Even more, I argue that we shall
 not rely on compiler for this at all. Ideally, we should be able to
 build FreeBSD using the stock compilers without local modifications.
 Thus relying on the symbols defined by compiler, and not the source
 is the thing to avoid and consistently remove.
 
 We must do this to be able to use third-party tooldchain for FreeBSD 
 builds.
 
 That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version 
 ?
 And then make more strong wording about other systems that use the macro,
 e.g. remove 'may' from the kFreeBSD example.
 Also, please remove the smile from comment.
 
 Ok. New patch attached.
 
 And the last, question, why not do
 #ifndef __FreeBSD_kernel__
 #define __FreeBSD_kernel__ __FreeBSD_version
 #endif
 ?
 
 #undef is too big tools tool apply there, IMO.
 
 #ifndef is too big to apply here, IMO :-).  __FreeBSD_kernel__ is in the
 implementation namespace, so any previous definition of it is a bug.  The
 #ifndef breaks the warning for this bug.
FreeBSD does not need it at all. There are some implementations that
use FreeBSD kernel, and which could potentially benefit from providing
its own value for __FreeBSD_kernel. I was only tried to be nice for
the out-of-tree implementations, it boils down to kFreeBSD project
preferences.

 
 And why not use FreeBSD style?  In KNF, the fields are separated by
 tabs, not spaces.  In FreeBSD style, trailing underscores are not used
 for names in the implementation namespace, since they have no effect
 on namespaces.  The name __FreeBSD_version is an example of this.  Does
 existing practice require using the name with the trailing underscores?


pgpM8lL9ZnvAT.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-20 Thread Kostik Belousov

On Sun, Nov 20, 2011 at 05:37:33PM +0100, Attilio Rao wrote:
 2011/11/18 Attilio Rao atti...@freebsd.org:
  Please consider:
  http://www.freebsd.org/~attilio/mutexfileline2.patch
 
 This is now committed as r227758,227759, you can update your patch now.
Here is it.

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index d592ac0..74e5126 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2843,6 +2843,34 @@ vm_page_test_dirty(vm_page_t m)
vm_page_dirty(m);
 }
 
+void
+vm_page_lock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   mtx_lock_flags_(vm_page_lockptr(m), 0, file, line);
+}
+
+void
+vm_page_unlock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   mtx_unlock_flags_(vm_page_lockptr(m), 0, file, line);
+}
+
+int
+vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line));
+}
+
+void
+vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line)
+{
+
+   mtx_assert_(vm_page_lockptr(m), a, file, line);
+}
+
 int so_zerocp_fullpage = 0;
 
 /*
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 151710d..fe0295b 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[];
 
 #definePA_LOCK_ASSERT(pa, a)   mtx_assert(PA_LOCKPTR(pa), (a))
 
+#ifdef KLD_MODULE
+#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_unlock(m)   vm_page_unlock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_trylock(m)  vm_page_trylock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#ifdef INVARIANTS
+#definevm_page_lock_assert(m, a)   \
+vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE)
+#else
+#definevm_page_lock_assert(m, a)
+#endif
+#else  /* KLD_MODULE */
 #definevm_page_lockptr(m)  (PA_LOCKPTR(VM_PAGE_TO_PHYS((m
 #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m)))
 #definevm_page_unlock(m)   mtx_unlock(vm_page_lockptr((m)))
 #definevm_page_trylock(m)  mtx_trylock(vm_page_lockptr((m)))
 #definevm_page_lock_assert(m, a)   
mtx_assert(vm_page_lockptr((m)), (a))
+#endif
 
 #definevm_page_queue_free_mtx  vm_page_queue_free_lock.data
 /*
@@ -405,6 +417,11 @@ void vm_page_cowfault (vm_page_t);
 int vm_page_cowsetup(vm_page_t);
 void vm_page_cowclear (vm_page_t);
 
+void vm_page_lock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_unlock_KBI(vm_page_t m, const char *file, int line);
+int vm_page_trylock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line);
+
 #ifdef INVARIANTS
 void vm_page_object_lock_assert(vm_page_t m);
 #defineVM_PAGE_OBJECT_LOCK_ASSERT(m)   vm_page_object_lock_assert(m)


pgpyNMDth73tZ.pgp
Description: PGP signature

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers

2011-11-20 Thread Kostik Belousov

On Sun, Nov 20, 2011 at 12:40:42PM +0100, Robert Millan wrote:
 On Sat, Nov 19, 2011 at 07:56:20PM +0200, Kostik Belousov wrote:
  I fully agree with an idea that compiler is not an authorative source
  of the knowledge of the FreeBSD version. Even more, I argue that we shall
  not rely on compiler for this at all. Ideally, we should be able to
  build FreeBSD using the stock compilers without local modifications.
  Thus relying on the symbols defined by compiler, and not the source
  is the thing to avoid and consistently remove.
  
  We must do this to be able to use third-party tooldchain for FreeBSD builds.
  
  That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ?
  And then make more strong wording about other systems that use the macro,
  e.g. remove 'may' from the kFreeBSD example.
  Also, please remove the smile from comment.
 
 Ok. New patch attached.

And the last, question, why not do
#ifndef __FreeBSD_kernel__
#define __FreeBSD_kernel__ __FreeBSD_version
#endif
?

#undef is too big tools tool apply there, IMO.


pgp1sbRkFtD8N.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-20 Thread Kostik Belousov

On Sun, Nov 20, 2011 at 07:02:14PM +0100, Attilio Rao wrote:
 2011/11/20 Kostik Belousov kostik...@gmail.com:
  +#define        vm_page_lock_assert(m, a)       \
  +    vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE)
 
 I think you should put the \ in the last tab and also, for
 consistency, you may want to use __FILE__ and __LINE__ for assert (or
 maybe I should also switch mutex.h to use LOCK_FILE and LOCK_LINE at
 some point?).
I never saw the requirement for the backslash. I am consistent with
PA_UNLOCK_COND() several lines above.

Changed assert to use __FILE/LINE__.

 
  +#else
  +#define        vm_page_lock_assert(m, a)
  +#endif
  +#else  /* KLD_MODULE */
 
 This should be /* !KLD_MODULE */, I guess?
Changed.

 
   #define        vm_page_lockptr(m)
 
 This is not defined for the KLD_MODULE case?
Yes, explicitely. This was discussed.
http://lists.freebsd.org/pipermail/freebsd-current/2011-November/029009.html

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index d592ac0..74e5126 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2843,6 +2843,34 @@ vm_page_test_dirty(vm_page_t m)
vm_page_dirty(m);
 }
 
+void
+vm_page_lock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   mtx_lock_flags_(vm_page_lockptr(m), 0, file, line);
+}
+
+void
+vm_page_unlock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   mtx_unlock_flags_(vm_page_lockptr(m), 0, file, line);
+}
+
+int
+vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   return (mtx_trylock_flags_(vm_page_lockptr(m), 0, file, line));
+}
+
+void
+vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line)
+{
+
+   mtx_assert_(vm_page_lockptr(m), a, file, line);
+}
+
 int so_zerocp_fullpage = 0;
 
 /*
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 151710d..1fab735 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[];
 
 #definePA_LOCK_ASSERT(pa, a)   mtx_assert(PA_LOCKPTR(pa), (a))
 
+#ifdef KLD_MODULE
+#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_unlock(m)   vm_page_unlock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_trylock(m)  vm_page_trylock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#ifdef INVARIANTS
+#definevm_page_lock_assert(m, a)   \
+vm_page_lock_assert_KBI((m), (a), __FILE__, __LINE__)
+#else
+#definevm_page_lock_assert(m, a)
+#endif
+#else  /* !KLD_MODULE */
 #definevm_page_lockptr(m)  (PA_LOCKPTR(VM_PAGE_TO_PHYS((m
 #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m)))
 #definevm_page_unlock(m)   mtx_unlock(vm_page_lockptr((m)))
 #definevm_page_trylock(m)  mtx_trylock(vm_page_lockptr((m)))
 #definevm_page_lock_assert(m, a)   
mtx_assert(vm_page_lockptr((m)), (a))
+#endif
 
 #definevm_page_queue_free_mtx  vm_page_queue_free_lock.data
 /*
@@ -405,6 +417,11 @@ void vm_page_cowfault (vm_page_t);
 int vm_page_cowsetup(vm_page_t);
 void vm_page_cowclear (vm_page_t);
 
+void vm_page_lock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_unlock_KBI(vm_page_t m, const char *file, int line);
+int vm_page_trylock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line);
+
 #ifdef INVARIANTS
 void vm_page_object_lock_assert(vm_page_t m);
 #defineVM_PAGE_OBJECT_LOCK_ASSERT(m)   vm_page_object_lock_assert(m)


pgpRqyzyLnGBT.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-20 Thread Kostik Belousov

On Sun, Nov 20, 2011 at 08:04:21PM +0100, Attilio Rao wrote:
 This other patch converts sx to a similar interface which cleans up vm_map.c:
 http://www.freebsd.org/~attilio/sxfileline.patch
 
 What do you think about it?

This one only changes the KBI ? Note that _sx suffix is not reserved.


pgpCdjEU3nGge.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-20 Thread Kostik Belousov

On Sun, Nov 20, 2011 at 08:22:38PM +0100, Attilio Rao wrote:
 2011/11/20 Kostik Belousov kostik...@gmail.com:
  On Sun, Nov 20, 2011 at 08:04:21PM +0100, Attilio Rao wrote:
  This other patch converts sx to a similar interface which cleans up 
  vm_map.c:
  http://www.freebsd.org/~attilio/sxfileline.patch
 
  What do you think about it?
 
  This one only changes the KBI ? Note that _sx suffix is not reserved.
 
 In which sense?
 If you want to keep the shim support for KLD (thus the hard path) you
 will always need to keep an hard function and thus you still need a
 macro acting as a gate between the 'hard function' (or KLD version, if
 you prefer) and the fast case, that is where the _ suffix came from.

As I see, right now kernel exports e.g. _sx_try_slock() for the hard path.
After the patch, it will export sx_try_slock_() for the same purpose.
The old modules, which call _sx_try_slock(), cannot be loaded into
the patched kernel. Am I reading the patch wrong ?


pgperrxtbJjmU.pgp
Description: PGP signature

Re: [PATCH] Detect GNU/kFreeBSD in user-visible kernel headers

2011-11-19 Thread Kostik Belousov

On Sat, Nov 19, 2011 at 10:32:50AM +0100, Robert Millan wrote:
 2011/11/18 Robert Millan r...@freebsd.org:
  2011/11/17 John Baldwin j...@freebsd.org:
  Hmm, I wonder if it's better to use the #ifndef approach rather than #undef
  so that when compilers are updated to DTRT we will honor their settings?
 
  Well, the compiler is supposed to know better than sys/param.h,
 
 I gave this a bit more thought
 
 The compiler knows about the system it was intended to build for, but
 sys/param.h *is* part of the system we're building for.  It's
 impossible for sys/param.h to have the wrong information unless it's
 out of sync with the rest of the headers.
 
 As for the compiler, on FreeBSD it's hard for the compiler to be out
 of sync, because the system (in comparison with others) is tightly
 packaed and upgrade procedures are well defined.
 
 But if you take Debian, for example, we use the same compiler to build
 8-STABLE, 9-STABLE and 10-CURRENT kernels.  The compiler has no idea
 which version of FreeBSD the sources it is building come from.
 
 I wouldn't put compilers in general in a position where they're forced
 to know the FreeBSD major version beforehand because sys/param.h will
 give preference to the information coming from compiler.
 
 I propose this alternate patch which derives the major number from
 __FreeBSD_version instead.

I fully agree with an idea that compiler is not an authorative source
of the knowledge of the FreeBSD version. Even more, I argue that we shall
not rely on compiler for this at all. Ideally, we should be able to
build FreeBSD using the stock compilers without local modifications.
Thus relying on the symbols defined by compiler, and not the source
is the thing to avoid and consistently remove.

We must do this to be able to use third-party tooldchain for FreeBSD builds.

That said, why not define __FreeBSD_kernel as equal to __FreeBSD_version ?
And then make more strong wording about other systems that use the macro,
e.g. remove 'may' from the kFreeBSD example.
Also, please remove the smile from comment.




pgpkjFIJ5nrvj.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-18 Thread Kostik Belousov

On Fri, Nov 18, 2011 at 11:40:28AM +0100, Attilio Rao wrote:
 2011/11/16 Kostik Belousov kostik...@gmail.com:
  On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote:
  2011/11/7 Kostik Belousov kostik...@gmail.com:
   On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote:
   Ok.  I'll offer one final suggestion.  Please consider an alternative
   suffix to func.  Perhaps, kbi or KBI.  In other words, something
   that hints at the function's reason for existing.
  
   Sure. Below is the extraction of only vm_page_lock() bits, together
   with the suggested rename. When Attilio provides the promised 
   simplification
   of the mutex KPI, this can be reduced.
 
  My tentative patch is here:
  http://www.freebsd.org/~attilio/mutexfileline.patch
 
  I need to make more compile testing later, but it already compiles
  GENERIC + modules fine on HEAD.
 
  The patch provides a common entrypoint, option independent, for both
  fast case and debug/compat case.
  Additively, it almost entirely fixes the standard violation of the
  reserved namespace, as you described (the notable exception being the
  macro used in the fast path, that I want to fix as well, but in a
  separate commit).
 
  Now the file/line couplet can be passed to the _ suffix variant of
  the flag functions.
  Yes, this is exactly KPI that I would use when available for the
  vm_page_lock() patch.
 
 
  eadler@ reviewed the mutex.h comment.
 
  Please let me know what you think about it, as long as we agree on the
  patch I'll commit it.
  But I also agree with John that imposing large churn due to the elimination
  of the '__' prefix is too late now. At least it will make the change
  non-MFCable. Besides, we already lived with the names for 10+ years.
 
  I will be happy to have the part of the patch that exports the mtx_XXX_(mtx,
  file, line) defines which can be used without taking care of LOCK_DEBUG
  or MUTEX_NOINLINE in the consumer code.
 
 Ok, this patch should just add the compat stub:
 http://www.freebsd.org/~attilio/mutexfileline2.patch
Am I right that I would use mtx_lock_(mtx, file, line) etc ?
If yes, I am fine with it.

 
 I'll make more test-compiling later in the day, if you agree on it I
 will commit the patch tomorrow.
 
 Attilio
 
 
 -- 
 Peace can only be achieved by understanding - A. Einstein


pgp0D7fXhQgea.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-18 Thread Kostik Belousov

On Fri, Nov 18, 2011 at 11:56:55AM +0100, Attilio Rao wrote:
 2011/11/18 Kostik Belousov kostik...@gmail.com:
  On Fri, Nov 18, 2011 at 11:40:28AM +0100, Attilio Rao wrote:
  2011/11/16 Kostik Belousov kostik...@gmail.com:
   On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote:
   2011/11/7 Kostik Belousov kostik...@gmail.com:
On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote:
Ok.  I'll offer one final suggestion.  Please consider an alternative
suffix to func.  Perhaps, kbi or KBI.  In other words, 
something
that hints at the function's reason for existing.
   
Sure. Below is the extraction of only vm_page_lock() bits, together
with the suggested rename. When Attilio provides the promised 
simplification
of the mutex KPI, this can be reduced.
  
   My tentative patch is here:
   http://www.freebsd.org/~attilio/mutexfileline.patch
  
   I need to make more compile testing later, but it already compiles
   GENERIC + modules fine on HEAD.
  
   The patch provides a common entrypoint, option independent, for both
   fast case and debug/compat case.
   Additively, it almost entirely fixes the standard violation of the
   reserved namespace, as you described (the notable exception being the
   macro used in the fast path, that I want to fix as well, but in a
   separate commit).
  
   Now the file/line couplet can be passed to the _ suffix variant of
   the flag functions.
   Yes, this is exactly KPI that I would use when available for the
   vm_page_lock() patch.
  
  
   eadler@ reviewed the mutex.h comment.
  
   Please let me know what you think about it, as long as we agree on the
   patch I'll commit it.
   But I also agree with John that imposing large churn due to the 
   elimination
   of the '__' prefix is too late now. At least it will make the change
   non-MFCable. Besides, we already lived with the names for 10+ years.
  
   I will be happy to have the part of the patch that exports the 
   mtx_XXX_(mtx,
   file, line) defines which can be used without taking care of LOCK_DEBUG
   or MUTEX_NOINLINE in the consumer code.
 
  Ok, this patch should just add the compat stub:
  http://www.freebsd.org/~attilio/mutexfileline2.patch
  Am I right that I would use mtx_lock_(mtx, file, line) etc ?
  If yes, I am fine with it.
 
 Yes that is correct.
 
 However, I'm a bit confused on one aspect: would you mind using
 _mtx_lock_flags() instead?
 If you don't mind the underscore namespace violation I think I can
 make a much smaller patch against HEAD for it.
_mtx_lock_flags() is fine. The reserved names start with __ or _[A-Z].

 
 Otherwise, the one now posted should be ok.
 
 Attilio
 
 
 -- 
 Peace can only be achieved by understanding - A. Einstein


pgpMbNbc9HZKI.pgp
Description: PGP signature

Re: Stop scheduler on panic

2011-11-17 Thread Kostik Belousov

On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote:
 On 17.11.2011 00:21, Andriy Gapon wrote:
 on 16/11/2011 21:27 Fabian Keil said the following:
 Kostik Belousovkostik...@gmail.com  wrote:
 
 I was tricked into finishing the work by Andrey Gapon, who developed
 the patch to reliably stop other processors on panic.  The patch
 greatly improves the chances of getting dump on panic on SMP host.
 
 I tested the patch trying to get a dump (from the debugger) for
 kern/162036, which currently results in the double fault reported in:
 http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
 
 It didn't help, but also didn't make anything worse.
 
 Fabian
 
 The mi_switch recursion looks very familiar to me:
 mi_switch() at mi_switch+0x270
 critical_exit() at critical_exit+0x9b
 spinlock_exit() at spinlock_exit+0x17
 mi_switch() at mi_switch+0x275
 critical_exit() at critical_exit+0x9b
 spinlock_exit() at spinlock_exit+0x17
 [several pages of the previous three lines skipped]
 mi_switch() at mi_switch+0x275
 critical_exit() at critical_exit+0x9b
 spinlock_exit() at spinlock_exit+0x17
 intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
 ahci_end_transaction() at ahci_end_transaction+0x398
 ahci_ch_intr() at ahci_ch_intr+0x2b5
 ahcipoll() at ahcipoll+0x15
 xpt_polled_action() at xpt_polled_action+0xf7
 
 In fact I once discussed with jhb this recursion triggered from a different
 place.  To quote myself:
 avgspinlock_exit -  critical_exit -  mi_switch -  kdb_switch -
 thread_unlock -  spinlock_exit -  critical_exit -  mi_switch -  ...
 avgin the kdb context
 avgthis issue seems to be triggered by td_owepreempt being true at 
 the time
 kdb is entered
 avgand there of course has to be an initial spinlock_exit call 
 somewhere
 avgin my case it's because of usb keyboard
 avgI wonder if it would make sense to clear td_owepreempt right 
 before
 calling kdb_switch in mi_switch
 avginstead of in sched_switch()
 avgclearing td_owepreempt seems like a scheduler-independent 
 operation to me
 avgor is it better to just skip locking in usb when kdb_active is set
 avg?
 
 The workaround described above should work in this case.
 Another possibility is to pessimize mtx_unlock_spin() implementations to 
 check
 SCHEDULER_STOPPED() and to bypass any further actions in that case.  But 
 that
 would add unnecessary overhead to the sunny day code paths.
 
 Going further up the stack one can come up with the following proposals:
 - check SCHEDULER_STOPPED() swi_sched() and return early
 - do not call swi_sched() from xpt_done() if we somehow know that we are 
 in a
 polling mode
 
 There is no flag in CAM now to indicate polling mode, but if needed, it 
 should not be difficult to add one and not call swi_sched().

I have the following change for eons on my test boxes. Without it,
I simply cannot get _any_ dump.

diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c
index 10b89c7..a38e42f 100644
--- a/sys/cam/cam_xpt.c
+++ b/sys/cam/cam_xpt.c
@@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb)
TAILQ_INSERT_TAIL(cam_simq, sim, links);
mtx_unlock(cam_simq_lock);
sim-flags |= CAM_SIM_ON_DONEQ;
-   if (first)
+   if (first  panicstr == NULL)
swi_sched(cambio_ih, 0);
}
}


pgp3ESGBc0NMm.pgp
Description: PGP signature

Re: Stop scheduler on panic

2011-11-17 Thread Kostik Belousov

On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote:
 On 11/17/11 10:15, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote:
  On 17.11.2011 00:21, Andriy Gapon wrote:
  on 16/11/2011 21:27 Fabian Keil said the following:
  Kostik Belousovkostik...@gmail.com  wrote:
 
  I was tricked into finishing the work by Andrey Gapon, who developed
  the patch to reliably stop other processors on panic.  The patch
  greatly improves the chances of getting dump on panic on SMP host.
 
  I tested the patch trying to get a dump (from the debugger) for
  kern/162036, which currently results in the double fault reported in:
  http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
 
  It didn't help, but also didn't make anything worse.
 
  Fabian
 
  The mi_switch recursion looks very familiar to me:
  mi_switch() at mi_switch+0x270
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  [several pages of the previous three lines skipped]
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
  ahci_end_transaction() at ahci_end_transaction+0x398
  ahci_ch_intr() at ahci_ch_intr+0x2b5
  ahcipoll() at ahcipoll+0x15
  xpt_polled_action() at xpt_polled_action+0xf7
 
  In fact I once discussed with jhb this recursion triggered from a 
  different
  place.  To quote myself:
  avgspinlock_exit -  critical_exit -  mi_switch -  kdb_switch -
  thread_unlock -  spinlock_exit -  critical_exit -  mi_switch -  ...
  avgin the kdb context
  avgthis issue seems to be triggered by td_owepreempt being true at 
  the time
  kdb is entered
  avgand there of course has to be an initial spinlock_exit call 
  somewhere
  avgin my case it's because of usb keyboard
  avgI wonder if it would make sense to clear td_owepreempt right 
  before
  calling kdb_switch in mi_switch
  avginstead of in sched_switch()
  avgclearing td_owepreempt seems like a scheduler-independent 
  operation to me
  avgor is it better to just skip locking in usb when kdb_active is 
  set
  avg?
 
  The workaround described above should work in this case.
  Another possibility is to pessimize mtx_unlock_spin() implementations to 
  check
  SCHEDULER_STOPPED() and to bypass any further actions in that case.  But 
  that
  would add unnecessary overhead to the sunny day code paths.
 
  Going further up the stack one can come up with the following proposals:
  - check SCHEDULER_STOPPED() swi_sched() and return early
  - do not call swi_sched() from xpt_done() if we somehow know that we are 
  in a
  polling mode
 
  There is no flag in CAM now to indicate polling mode, but if needed, it 
  should not be difficult to add one and not call swi_sched().
  
  I have the following change for eons on my test boxes. Without it,
  I simply cannot get _any_ dump.
  
  diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c
  index 10b89c7..a38e42f 100644
  --- a/sys/cam/cam_xpt.c
  +++ b/sys/cam/cam_xpt.c
  @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb)
  TAILQ_INSERT_TAIL(cam_simq, sim, links);
  mtx_unlock(cam_simq_lock);
  sim-flags |= CAM_SIM_ON_DONEQ;
  -   if (first)
  +   if (first  panicstr == NULL)
  swi_sched(cambio_ih, 0);
  }
  }
 
 That should be OK for kernel dumping. I was thinking about CAM abusing
 polling not only for dumping. But looking on cases where it does it now,
 may be it is better to rewrite them instead.

So, should I interpret your response as 'Reviewed by ?


pgplu0xBQRD1h.pgp
Description: PGP signature

Re: Stop scheduler on panic

2011-11-17 Thread Kostik Belousov

On Thu, Nov 17, 2011 at 11:29:02AM +0200, Alexander Motin wrote:
 On 11/17/11 11:06, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote:
  On 11/17/11 10:15, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote:
  On 17.11.2011 00:21, Andriy Gapon wrote:
  on 16/11/2011 21:27 Fabian Keil said the following:
  Kostik Belousovkostik...@gmail.com  wrote:
 
  I was tricked into finishing the work by Andrey Gapon, who developed
  the patch to reliably stop other processors on panic.  The patch
  greatly improves the chances of getting dump on panic on SMP host.
 
  I tested the patch trying to get a dump (from the debugger) for
  kern/162036, which currently results in the double fault reported in:
  http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
 
  It didn't help, but also didn't make anything worse.
 
  Fabian
 
  The mi_switch recursion looks very familiar to me:
  mi_switch() at mi_switch+0x270
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  [several pages of the previous three lines skipped]
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
  ahci_end_transaction() at ahci_end_transaction+0x398
  ahci_ch_intr() at ahci_ch_intr+0x2b5
  ahcipoll() at ahcipoll+0x15
  xpt_polled_action() at xpt_polled_action+0xf7
 
  In fact I once discussed with jhb this recursion triggered from a 
  different
  place.  To quote myself:
  avgspinlock_exit -  critical_exit -  mi_switch -  kdb_switch -
  thread_unlock -  spinlock_exit -  critical_exit -  mi_switch -  ...
  avgin the kdb context
  avgthis issue seems to be triggered by td_owepreempt being true 
  at 
  the time
  kdb is entered
  avgand there of course has to be an initial spinlock_exit call 
  somewhere
  avgin my case it's because of usb keyboard
  avgI wonder if it would make sense to clear td_owepreempt right 
  before
  calling kdb_switch in mi_switch
  avginstead of in sched_switch()
  avgclearing td_owepreempt seems like a scheduler-independent 
  operation to me
  avgor is it better to just skip locking in usb when kdb_active is 
  set
  avg?
 
  The workaround described above should work in this case.
  Another possibility is to pessimize mtx_unlock_spin() implementations 
  to 
  check
  SCHEDULER_STOPPED() and to bypass any further actions in that case.  
  But 
  that
  would add unnecessary overhead to the sunny day code paths.
 
  Going further up the stack one can come up with the following proposals:
  - check SCHEDULER_STOPPED() swi_sched() and return early
  - do not call swi_sched() from xpt_done() if we somehow know that we 
  are 
  in a
  polling mode
 
  There is no flag in CAM now to indicate polling mode, but if needed, it 
  should not be difficult to add one and not call swi_sched().
 
  I have the following change for eons on my test boxes. Without it,
  I simply cannot get _any_ dump.
 
  diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c
  index 10b89c7..a38e42f 100644
  --- a/sys/cam/cam_xpt.c
  +++ b/sys/cam/cam_xpt.c
  @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb)
TAILQ_INSERT_TAIL(cam_simq, sim, links);
mtx_unlock(cam_simq_lock);
sim-flags |= CAM_SIM_ON_DONEQ;
  - if (first)
  + if (first  panicstr == NULL)
swi_sched(cambio_ih, 0);
}
}
 
  That should be OK for kernel dumping. I was thinking about CAM abusing
  polling not only for dumping. But looking on cases where it does it now,
  may be it is better to rewrite them instead.
  
  So, should I interpret your response as 'Reviewed by ?
 
 It feels somehow dirty to me. I don't like these global variables. If
 you consider it is fine, proceed, I see no much harm. But if not, I can
 add polling flag to the CAM. Flip a coin for me. :)
You promised to add the polling at summer' meeting in Kiev. Will you do
it now ?


pgpEdTAw2tcfa.pgp
Description: PGP signature

Re: Stop scheduler on panic

2011-11-17 Thread Kostik Belousov

On Thu, Nov 17, 2011 at 02:40:45PM +0200, Alexander Motin wrote:
 On 11/17/11 13:14, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 11:29:02AM +0200, Alexander Motin wrote:
  On 11/17/11 11:06, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 10:40:58AM +0200, Alexander Motin wrote:
  On 11/17/11 10:15, Kostik Belousov wrote:
  On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote:
  On 17.11.2011 00:21, Andriy Gapon wrote:
  on 16/11/2011 21:27 Fabian Keil said the following:
  Kostik Belousovkostik...@gmail.com  wrote:
 
  I was tricked into finishing the work by Andrey Gapon, who developed
  the patch to reliably stop other processors on panic.  The patch
  greatly improves the chances of getting dump on panic on SMP host.
 
  I tested the patch trying to get a dump (from the debugger) for
  kern/162036, which currently results in the double fault reported in:
  http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
 
  It didn't help, but also didn't make anything worse.
 
  Fabian
 
  The mi_switch recursion looks very familiar to me:
  mi_switch() at mi_switch+0x270
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  [several pages of the previous three lines skipped]
  mi_switch() at mi_switch+0x275
  critical_exit() at critical_exit+0x9b
  spinlock_exit() at spinlock_exit+0x17
  intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
  ahci_end_transaction() at ahci_end_transaction+0x398
  ahci_ch_intr() at ahci_ch_intr+0x2b5
  ahcipoll() at ahcipoll+0x15
  xpt_polled_action() at xpt_polled_action+0xf7
 
  In fact I once discussed with jhb this recursion triggered from a 
  different
  place.  To quote myself:
  avgspinlock_exit -  critical_exit -  mi_switch -  kdb_switch 
  -
  thread_unlock -  spinlock_exit -  critical_exit -  mi_switch -  
  ...
  avgin the kdb context
  avgthis issue seems to be triggered by td_owepreempt being true 
  at 
  the time
  kdb is entered
  avgand there of course has to be an initial spinlock_exit call 
  somewhere
  avgin my case it's because of usb keyboard
  avgI wonder if it would make sense to clear td_owepreempt right 
  before
  calling kdb_switch in mi_switch
  avginstead of in sched_switch()
  avgclearing td_owepreempt seems like a scheduler-independent 
  operation to me
  avgor is it better to just skip locking in usb when kdb_active 
  is set
  avg?
 
  The workaround described above should work in this case.
  Another possibility is to pessimize mtx_unlock_spin() implementations 
  to 
  check
  SCHEDULER_STOPPED() and to bypass any further actions in that case.  
  But 
  that
  would add unnecessary overhead to the sunny day code paths.
 
  Going further up the stack one can come up with the following 
  proposals:
  - check SCHEDULER_STOPPED() swi_sched() and return early
  - do not call swi_sched() from xpt_done() if we somehow know that we 
  are 
  in a
  polling mode
 
  There is no flag in CAM now to indicate polling mode, but if needed, 
  it 
  should not be difficult to add one and not call swi_sched().
 
  I have the following change for eons on my test boxes. Without it,
  I simply cannot get _any_ dump.
 
  diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c
  index 10b89c7..a38e42f 100644
  --- a/sys/cam/cam_xpt.c
  +++ b/sys/cam/cam_xpt.c
  @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb)
  TAILQ_INSERT_TAIL(cam_simq, sim, links);
  mtx_unlock(cam_simq_lock);
  sim-flags |= CAM_SIM_ON_DONEQ;
  -   if (first)
  +   if (first  panicstr == NULL)
  swi_sched(cambio_ih, 0);
  }
  }
 
  That should be OK for kernel dumping. I was thinking about CAM abusing
  polling not only for dumping. But looking on cases where it does it now,
  may be it is better to rewrite them instead.
 
  So, should I interpret your response as 'Reviewed by ?
 
  It feels somehow dirty to me. I don't like these global variables. If
  you consider it is fine, proceed, I see no much harm. But if not, I can
  add polling flag to the CAM. Flip a coin for me. :)
  You promised to add the polling at summer' meeting in Kiev. Will you do
  it now ?
 
 Sorry, I've probably forgot. The patch is attached.
 
 -- 
 Alexander Motin

 Index: cam_sim.h
 ===
 --- cam_sim.h (revision 227580)
 +++ cam_sim.h (working copy)
 @@ -104,7 +104,8 @@
   u_int32_t   flags;
  #define  CAM_SIM_REL_TIMEOUT_PENDING 0x01
  #define  CAM_SIM_MPSAFE  0x02
 -#define CAM_SIM_ON_DONEQ 0x04
 +#define  CAM_SIM_ON_DONEQ0x04
 +#define  CAM_SIM_POLLED  0x08

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-16 Thread Kostik Belousov

On Tue, Nov 15, 2011 at 07:15:01PM +0100, Attilio Rao wrote:
 2011/11/7 Kostik Belousov kostik...@gmail.com:
  On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote:
  Ok.  I'll offer one final suggestion.  Please consider an alternative
  suffix to func.  Perhaps, kbi or KBI.  In other words, something
  that hints at the function's reason for existing.
 
  Sure. Below is the extraction of only vm_page_lock() bits, together
  with the suggested rename. When Attilio provides the promised simplification
  of the mutex KPI, this can be reduced.
 
 My tentative patch is here:
 http://www.freebsd.org/~attilio/mutexfileline.patch
 
 I need to make more compile testing later, but it already compiles
 GENERIC + modules fine on HEAD.
 
 The patch provides a common entrypoint, option independent, for both
 fast case and debug/compat case.
 Additively, it almost entirely fixes the standard violation of the
 reserved namespace, as you described (the notable exception being the
 macro used in the fast path, that I want to fix as well, but in a
 separate commit).
 
 Now the file/line couplet can be passed to the _ suffix variant of
 the flag functions.
Yes, this is exactly KPI that I would use when available for the
vm_page_lock() patch.

 
 eadler@ reviewed the mutex.h comment.
 
 Please let me know what you think about it, as long as we agree on the
 patch I'll commit it.
But I also agree with John that imposing large churn due to the elimination
of the '__' prefix is too late now. At least it will make the change
non-MFCable. Besides, we already lived with the names for 10+ years.

I will be happy to have the part of the patch that exports the mtx_XXX_(mtx,
file, line) defines which can be used without taking care of LOCK_DEBUG
or MUTEX_NOINLINE in the consumer code.


pgpOwIEH4H9gF.pgp
Description: PGP signature

Re: [RFC] Enable nxstack by default

2011-11-16 Thread Kostik Belousov

On Wed, Nov 16, 2011 at 01:09:18AM +0100, Oliver Pinter wrote:
 On 11/15/11, Jeremie Le Hen jere...@le-hen.org wrote:
  Hi,
 
  On Wed, Oct 19, 2011 at 12:37:44AM +0200, Oliver Pinter wrote:
  In NetBSD has been some PaX feature [0] implemented. (ASLR, W^X
  (~nxstack), mprotect restriction, veriexec, mmap randomization[2]...)
 
  [0] http://pax.grsecurity.net/docs/index.html
  [1] http://www.netbsd.org/~elad/recent/man/security.8.html
  [2] http://people.freebsd.org/~ssouhlal/testing/stackgap-20050527.diff
 
  Suleiman actually wrought two patches, one randomizing the stack (the
  one you pointed out) and another one randomizing non-fixed mmap(2)
  calls:
 
  http://people.freebsd.org/~ssouhlal/testing/mmap_random-20050528.diff
 
 
  FYI, they do not apply cleanly on recent source trees (the patches were
  made in 2005), but they can be applied with little fiddling.  I'm
  running multiple 8.x production machines with them without any problem.
 
 Yeah, I use thins patch in 7-STABLE and 9-STABLE too.
 Patch for 9-STABLE has attached.
One immediate issue, which is definitely not critical, is that the size
of the stack of main thread becomes chopped by the random amount of
bytes. This is not an issue for single-threaded process, because typical
default stack size is around 64M. For the threaded process, libthr cuts
the stack, see thr_init.c:init_main_thread(). There, the size of the
stack is 2 or 4MB, and 64KB might be more significant part of it.

Missed bit from the patch is some randomization at the load address
for the PIE (which is the main feature of ASLR, I suspect). See
imgact_elf.c:exec(), et_dyn_addr calculation.

Another missed bit is the similar modification for
freebsd32_copyout_strings().

The upper limit for the random offset for mmap() should be configurable
in the same way as stack gap, instead of the dump enable/disable knob.

There are numerous style violations in the patch, or rather, the patch
fully violates the style.

 
 
 
 
  I've always wanted them to be committed as opt-in knobs, but I can't
  remember why they hadn't at the time.
 
  Cheers,
  --
  Jeremie Le Hen
 
  Men are born free and equal.  Later on, they're on their own.
  Jean Yanne
 




pgp0wWrKbOxRT.pgp
Description: PGP signature

Re: [head tinderbox] failure on amd64/amd64

2011-11-16 Thread Kostik Belousov

On Wed, Nov 16, 2011 at 06:18:59PM +, FreeBSD Tinderbox wrote:
 TB --- 2011-11-16 15:30:00 - tinderbox 2.8 running on 
 freebsd-current.sentex.ca
 TB --- 2011-11-16 15:30:00 - starting HEAD tinderbox run for amd64/amd64
 TB --- 2011-11-16 15:30:00 - cleaning the object tree
 TB --- 2011-11-16 15:31:04 - cvsupping the source tree
 TB --- 2011-11-16 15:31:04 - /usr/bin/csup -z -r 3 -g -L 1 -h cvsup.sentex.ca 
 /tinderbox/HEAD/amd64/amd64/supfile
 TB --- 2011-11-16 15:31:16 - building world
 TB --- 2011-11-16 15:31:16 - CROSS_BUILD_TESTING=YES
 TB --- 2011-11-16 15:31:16 - MAKEOBJDIRPREFIX=/obj
 TB --- 2011-11-16 15:31:16 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
 TB --- 2011-11-16 15:31:16 - SRCCONF=/dev/null
 TB --- 2011-11-16 15:31:16 - TARGET=amd64
 TB --- 2011-11-16 15:31:16 - TARGET_ARCH=amd64
 TB --- 2011-11-16 15:31:16 - TZ=UTC
 TB --- 2011-11-16 15:31:16 - __MAKE_CONF=/dev/null
 TB --- 2011-11-16 15:31:16 - cd /src
 TB --- 2011-11-16 15:31:16 - /usr/bin/make -B buildworld
  World build started on Wed Nov 16 15:31:17 UTC 2011
  Rebuilding the temporary build tree
  stage 1.1: legacy release compatibility shims
  stage 1.2: bootstrap tools
  stage 2.1: cleaning up the object tree
  stage 2.2: rebuilding the object tree
  stage 2.3: build tools
  stage 3: cross tools
  stage 4.1: building includes
  stage 4.2: building libraries
  stage 4.3: make dependencies
  stage 4.4: building everything
  stage 5.1: building 32 bit shim libraries
  World build completed on Wed Nov 16 18:08:28 UTC 2011
 TB --- 2011-11-16 18:08:28 - generating LINT kernel config
 TB --- 2011-11-16 18:08:28 - cd /src/sys/amd64/conf
 TB --- 2011-11-16 18:08:28 - /usr/bin/make -B LINT
 TB --- 2011-11-16 18:08:28 - cd /src/sys/amd64/conf
 TB --- 2011-11-16 18:08:28 - /usr/sbin/config -m LINT-NOINET
 TB --- 2011-11-16 18:08:28 - building LINT-NOINET kernel
 TB --- 2011-11-16 18:08:28 - CROSS_BUILD_TESTING=YES
 TB --- 2011-11-16 18:08:28 - MAKEOBJDIRPREFIX=/obj
 TB --- 2011-11-16 18:08:28 - PATH=/usr/bin:/usr/sbin:/bin:/sbin
 TB --- 2011-11-16 18:08:28 - SRCCONF=/dev/null
 TB --- 2011-11-16 18:08:28 - TARGET=amd64
 TB --- 2011-11-16 18:08:28 - TARGET_ARCH=amd64
 TB --- 2011-11-16 18:08:28 - TZ=UTC
 TB --- 2011-11-16 18:08:28 - __MAKE_CONF=/dev/null
 TB --- 2011-11-16 18:08:28 - cd /src
 TB --- 2011-11-16 18:08:28 - /usr/bin/make -B buildkernel KERNCONF=LINT-NOINET
  Kernel build for LINT-NOINET started on Wed Nov 16 18:08:28 UTC 2011
  stage 1: configuring the kernel
  stage 2.1: cleaning up the object tree
  stage 2.2: rebuilding the object tree
  stage 2.3: build tools
  stage 3.1: making dependencies
  stage 3.2: building everything
 [...]
 cc -c -O2 -frename-registers -pipe -fno-strict-aliasing  -std=c99  -Wall 
 -Wredundant-decls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes 
 -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign 
 -fformat-extensions  -Wmissing-include-dirs -fdiagnostics-show-option 
 -nostdinc  -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL 
 -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common 
 -finline-limit=8000 --param inline-unit-growth=100 --param 
 large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF 
 -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone 
 -mno-mmx -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding 
 -fstack-protector -Werror -pg -mprofiler-epilogue 
 /src/sys/fs/procfs/procfs_type.c
 cc -c -O2 -frename-registers -pipe -fno-strict-aliasing  -std=c99  -Wall 
 -Wredundant-decls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes 
 -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign 
 -fformat-extensions  -Wmissing-include-dirs -fdiagnostics-show-option 
 -nostdinc  -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL 
 -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common 
 -finline-limit=8000 --param inline-unit-growth=100 --param 
 large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF 
 -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone 
 -mno-mmx -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding 
 -fstack-protector -Werror -pg -mprofiler-epilogue 
 /src/sys/fs/pseudofs/pseudofs.c
 cc -c -O2 -frename-registers -pipe -fno-strict-aliasing  -std=c99  -Wall 
 -Wredundant-decls -Wnested-externs -Wstrict-prototypes  -Wmissing-prototypes 
 -Wpointer-arith -Winline -Wcast-qual  -Wundef -Wno-pointer-sign 
 -fformat-extensions  -Wmissing-include-dirs -fdiagnostics-show-option 
 -nostdinc  -I. -I/src/sys -I/src/sys/contrib/altq -D_KERNEL 
 -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common 
 -finline-limit=8000 --param inline-unit-growth=100 --param 
 large-function-growth=1000 -DGPROF -falign-functions=16 -DGPROF4 -DGUPROF 
 -fno-builtin -fno-omit-frame-pointer -mno-sse -mcmodel=kernel -mno-red-zone 
 -mno-mmx -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding 
 -fstack-protector -Werror -pg -mprofiler-epilogue

Re: Stop scheduler on panic

2011-11-14 Thread Kostik Belousov

On Mon, Nov 14, 2011 at 12:06:48PM +0200, Andriy Gapon wrote:
 on 13/11/2011 10:32 Kostik Belousov said the following:
  I was tricked into finishing the work by Andrey Gapon, who developed the
  patch to reliably stop other processors on panic.  The patch greatly
  improves the chances of getting dump on panic on SMP host. Several people
  already saw the patchset, and I remember that Andrey posted it to some
  lists.
  
  The change stops other (*) processors early upon the panic.  This way, no
  parallel manipulation of the kernel memory is performed by CPUs. In
  particular, the kernel memory map is static.  Patch prevents the panic
  thread from blocking and switching out.
  
  * - in the context of the description, other means not current.
  
  Since other threads are not run anymore, lock owner cannot release a lock
  which is required by panic thread.  Due to this, we need to fake a lock
  acquisition after the panic, which adds minimal overhead to the locking
  cost. The patch tries to not add any overhead on the fast path of the lock
  acquire.  The check for the after-panic condition was reduced to single
  memory access, done only when the quick cas lock attempt failed, and braced
  with __unlikely compiler hint.
  
  For now, the new mode of operation is disabled by default, since some 
  further USB changes are needed to make USB keyboard usable in that 
  environment.
  
  With the patch, getting a dump from the machine without debugger compiled
  in is much more realistic.  Please comment, I will commit the change in 2
  weeks unless strong reasons not to are given.
  
  http://people.freebsd.org/~kib/misc/stop_cpus_on_panic.1.patch
  
 
 
 On a more serious note:
 - some code in my latest version of the patch was contributed by or was based
 on the code or ideas contributed by jhb and mdf (so that attributions are not
 lost)
Please provide me with proper attribution for contributors and testers.

 - there was a concern about how sync-on-panic would work
 
 About the latter, I have never really tested it. mdf has suggested to
 move the sync-on-panic code to a place after we ensure that there is
 only one CPU in panic(), but before we stop other CPUs.
sync_on_panic is incompatible with the patch. I argue that it provides
non-zero chance of damaging good filesystems even if panic was unrelated
to the fs/bio/device layer. As an example, consider the case when other
CPU was modifying in-memory representation of the metadata, and panic
happen on this CPU. If you write half-changed block back, you make more
damage to the filesystem then if you do not. The half-backed sync spoils
any journaling or SU consistency guarantees.

The issues of multithreading nature of our storage subsystem are secondary.

The user who sets either tunable shall know what he does.

 I think that I've already sent you a list of the early testers for
 various WIP versions of the patch.
I do not have the list.

BTW, if you want, feel free to handle the commit youself. You definitely
spent much more efforts on the stuff and deserve the credit.

I was promised in private that a review will be provided during this
weekend.


pgpyIteKouhHY.pgp
Description: PGP signature

Stop scheduler on panic

2011-11-13 Thread Kostik Belousov

I was tricked into finishing the work by Andrey Gapon, who developed
the patch to reliably stop other processors on panic.  The patch
greatly improves the chances of getting dump on panic on SMP host.
Several people already saw the patchset, and I remember that Andrey
posted it to some lists.

The change stops other (*) processors early upon the panic.  This way,
no parallel manipulation of the kernel memory is performed by CPUs.
In particular, the kernel memory map is static.  Patch prevents the
panic thread from blocking and switching out.

* - in the context of the description, other means not current.

Since other threads are not run anymore, lock owner cannot release a
lock which is required by panic thread.  Due to this, we need to fake
a lock acquisition after the panic, which adds minimal overhead to the
locking cost. The patch tries to not add any overhead on the fast path
of the lock acquire.  The check for the after-panic condition was
reduced to single memory access, done only when the quick cas lock
attempt failed, and braced with __unlikely compiler hint.

For now, the new mode of operation is disabled by default, since some
further USB changes are needed to make USB keyboard usable in that
environment.

With the patch, getting a dump from the machine without debugger
compiled in is much more realistic.  Please comment, I will commit the
change in 2 weeks unless strong reasons not to are given.

http://people.freebsd.org/~kib/misc/stop_cpus_on_panic.1.patch



pgpCmKhLFDhgT.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-07 Thread Kostik Belousov

On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote:
 Ok.  I'll offer one final suggestion.  Please consider an alternative 
 suffix to func.  Perhaps, kbi or KBI.  In other words, something 
 that hints at the function's reason for existing.

Sure. Below is the extraction of only vm_page_lock() bits, together
with the suggested rename. When Attilio provides the promised simplification
of the mutex KPI, this can be reduced.

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 389aea5..ea4ea34 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2677,6 +2677,44 @@ vm_page_test_dirty(vm_page_t m)
vm_page_dirty(m);
 }
 
+void
+vm_page_lock_KBI(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_lock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_lock(vm_page_lockptr(m), 0, file, line);
+#endif
+}
+
+void
+vm_page_unlock_KBI(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line);
+#endif
+}
+
+int
+vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
+{
+
+   return (_mtx_trylock(vm_page_lockptr(m), 0, file, line));
+}
+
+void
+vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line)
+{
+
+#ifdef INVARIANTS
+   _mtx_assert(vm_page_lockptr(m), a, file, line);
+#endif
+}
+
 int so_zerocp_fullpage = 0;
 
 /*
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 7099b70..a5604b7 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[];
 
 #definePA_LOCK_ASSERT(pa, a)   mtx_assert(PA_LOCKPTR(pa), (a))
 
+#ifdef KLD_MODULE
+#definevm_page_lock(m) vm_page_lock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_unlock(m)   vm_page_unlock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_trylock(m)  vm_page_trylock_KBI((m), LOCK_FILE, 
LOCK_LINE)
+#ifdef INVARIANTS
+#definevm_page_lock_assert(m, a)   \
+vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE)
+#else
+#definevm_page_lock_assert(m, a)
+#endif
+#else  /* KLD_MODULE */
 #definevm_page_lockptr(m)  (PA_LOCKPTR(VM_PAGE_TO_PHYS((m
 #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m)))
 #definevm_page_unlock(m)   mtx_unlock(vm_page_lockptr((m)))
 #definevm_page_trylock(m)  mtx_trylock(vm_page_lockptr((m)))
 #definevm_page_lock_assert(m, a)   
mtx_assert(vm_page_lockptr((m)), (a))
+#endif
 
 #definevm_page_queue_free_mtx  vm_page_queue_free_lock.data
 /*
@@ -403,6 +415,11 @@ void vm_page_cowfault (vm_page_t);
 int vm_page_cowsetup(vm_page_t);
 void vm_page_cowclear (vm_page_t);
 
+void vm_page_lock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_unlock_KBI(vm_page_t m, const char *file, int line);
+int vm_page_trylock_KBI(vm_page_t m, const char *file, int line);
+void vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line);
+
 #ifdef INVARIANTS
 void vm_page_object_lock_assert(vm_page_t m);
 #defineVM_PAGE_OBJECT_LOCK_ASSERT(m)   vm_page_object_lock_assert(m)


pgpxFZVFLBvYF.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-07 Thread Kostik Belousov

On Mon, Nov 07, 2011 at 11:47:59AM -0800, m...@freebsd.org wrote:
 On Mon, Nov 7, 2011 at 11:35 AM, Kostik Belousov kostik...@gmail.com wrote:
  On Mon, Nov 07, 2011 at 11:45:38AM -0600, Alan Cox wrote:
  Ok.  I'll offer one final suggestion.  Please consider an alternative
  suffix to func.  Perhaps, kbi or KBI.  In other words, something
  that hints at the function's reason for existing.
 
  Sure. Below is the extraction of only vm_page_lock() bits, together
  with the suggested rename. When Attilio provides the promised simplification
  of the mutex KPI, this can be reduced.
 
  diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
  index 389aea5..ea4ea34 100644
  --- a/sys/vm/vm_page.c
  +++ b/sys/vm/vm_page.c
  @@ -2677,6 +2677,44 @@ vm_page_test_dirty(vm_page_t m)
                 vm_page_dirty(m);
   }
 
  +void
  +vm_page_lock_KBI(vm_page_t m, const char *file, int line)
  +{
  +
  +#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
  +       _mtx_lock_flags(vm_page_lockptr(m), 0, file, line);
  +#else
  +       __mtx_lock(vm_page_lockptr(m), 0, file, line);
  +#endif
  +}
  +
  +void
  +vm_page_unlock_KBI(vm_page_t m, const char *file, int line)
  +{
  +
  +#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
  +       _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line);
  +#else
  +       __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line);
  +#endif
  +}
  +
  +int
  +vm_page_trylock_KBI(vm_page_t m, const char *file, int line)
  +{
  +
  +       return (_mtx_trylock(vm_page_lockptr(m), 0, file, line));
  +}
  +
  +void
  +vm_page_lock_assert_KBI(vm_page_t m, int a, const char *file, int line)
  +{
  +
  +#ifdef INVARIANTS
  +       _mtx_assert(vm_page_lockptr(m), a, file, line);
  +#endif
  +}
  +
   int so_zerocp_fullpage = 0;
 
   /*
  diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
  index 7099b70..a5604b7 100644
  --- a/sys/vm/vm_page.h
  +++ b/sys/vm/vm_page.h
  @@ -218,11 +218,23 @@ extern struct vpglocks pa_lock[];
 
   #define        PA_LOCK_ASSERT(pa, a)   mtx_assert(PA_LOCKPTR(pa), (a))
 
  +#ifdef KLD_MODULE
  +#define        vm_page_lock(m)         vm_page_lock_KBI((m), LOCK_FILE, 
  LOCK_LINE)
  +#define        vm_page_unlock(m)       vm_page_unlock_KBI((m), LOCK_FILE, 
  LOCK_LINE)
  +#define        vm_page_trylock(m)      vm_page_trylock_KBI((m), LOCK_FILE, 
  LOCK_LINE)
  +#ifdef INVARIANTS
  +#define        vm_page_lock_assert(m, a)       \
  +    vm_page_lock_assert_KBI((m), (a), LOCK_FILE, LOCK_LINE)
  +#else
  +#define        vm_page_lock_assert(m, a)
  +#endif
  +#else  /* KLD_MODULE */
   #define        vm_page_lockptr(m)      (PA_LOCKPTR(VM_PAGE_TO_PHYS((m
 
 Is it not possible to have vm_page_lockptr() be a function for the
 KLD_MODULE case?  Because then the vm_page_lock() functions and
 friends would all just use mtx_lock, etc., in both the KLD and non-KLD
 case.
 
 Or am I missing something?
It is possible, but I tried to avoid exposing lockptr.
IMHO vm_page_lockptr() is an implementation detail.

Please also see my other response to Alan regarding the relocking macros.


pgpYgFHehhwCS.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-06 Thread Kostik Belousov

On Sat, Nov 05, 2011 at 03:00:58PM -0500, Alan Cox wrote:
 On 11/05/2011 10:15, Kostik Belousov wrote:
 On Sat, Nov 05, 2011 at 07:37:48AM -0700, m...@freebsd.org wrote:
 On Sat, Nov 5, 2011 at 7:13 AM, Kostik Belousovkostik...@gmail.com  
 wrote:
 On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote:
 
 Below is the KBI patch after vm_page_bits_t merge is done.
 Again, I did not spent time converting all in-tree consumers
 from the (potentially) loadable modules to the new KPI until it
 is agreed upon.
 I like my bikeshed orange...
 
 I would think a more canonical name would be get/set rather than
 read/write, especially since these operations aren't reading and
 writing the page (neither are they getting the page, but at least set
 is pretty unambiguous).
 Look at the vm_page.h:385. vm_page_set_valid() is already taken.
 
 I don't feel good about creating an interface under which we have 
 functions named vm_page_set_valid() and vm_page_write_valid().  I think 
 that we should take a step back and look at the whole of set of 
 functions that exist for manipulating the page's valid and dirty field 
 and see if we can come up with a logical schema for naming them.  I 
 wouldn't then be surprised if this results in renaming some of the 
 existing functions.
 
 However, this should not delay the changes to address the vm_page_lock 
 problem.  I had two questions about that part of your patch.  First, is 
 there any reason that you didn't include vm_page_lockptr()?  If we 
 created the page locking macros that you, jhb@, and I were talking about 
 last week, then vm_page_lockptr() would be required.  Second, there 
 seems to be precedent for naming the locking functions _vm_page_lock() 
 instead of vm_page_lock_func(), for example, the mutex code, i.e., 
 sys/mutex.h, and the vm map locking functions.
I think vm_page_lockptr() should be included when some kind of
reloc-iterating macros are actually introduced into the tree. And,
I have a hope that they can be wrapped around a function with the
signature like
void vm_page_relock(vm_page_t locked_page, vm_page_t unlocked_page);
which moves the lock from locked_page to unlocked_page.

Regarding the _vm_page_lock() vs. vm_page_lock_func(), the mutex.h has
a lot of violations in regard of the namespaces, IMO. The __* namespace
is reserved for the language implementation, so our freestanding program
(kernel) ignores the requirements of the C standard with the names like
__mtx_lock_spin(). Using the name _vm_page_lock() is valid, but makes
it not unreasonable for other developers to introduce reserved names.
So I decided to use the suffixes. vm_map.h locking is free of these
violations.



pgpr5pwKbvDyo.pgp
Description: PGP signature

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-06 Thread Kostik Belousov

On Sun, Nov 06, 2011 at 07:22:51AM -0800, m...@freebsd.org wrote:
 On Sun, Nov 6, 2011 at 4:43 AM, Kostik Belousov kostik...@gmail.com wrote:
  Regarding the _vm_page_lock() vs. vm_page_lock_func(), the mutex.h has
  a lot of violations in regard of the namespaces, IMO. The __* namespace
  is reserved for the language implementation, so our freestanding program
  (kernel) ignores the requirements of the C standard with the names like
  __mtx_lock_spin(). Using the name _vm_page_lock() is valid, but makes
  it not unreasonable for other developers to introduce reserved names.
  So I decided to use the suffixes. vm_map.h locking is free of these
  violations.
 
 I'm pretty sure that when the C standard says, the implementation,
 they're referring to the compiler and OS it runs on.  Which makes the
 FreeBSD kernel part of the implementation, which is precisely why so
 many headers have defines that start with __ and then, if certain
 posix defines are set, also uses non-__ versions of the name.

For libc providing parts, required by standard, you are right.
But our kernel is a freestanding program using a compiler, so in-kernel
uses of the reserved namespace is a violation.


pgp4CGtNzn8lW.pgp
Description: PGP signature

Re: 9.0/i386 build failure

2011-11-05 Thread Kostik Belousov

On Sat, Nov 05, 2011 at 02:20:03PM +0100, Dimitry Andric wrote:
 On 2011-11-04 20:09, Michael W. Lucas wrote:
  I suspect I'm building on a system that's too old, but it's worth
  asking.
  
  FreeBSD eyeball.lodden.com 9.0-CURRENT FreeBSD 9.0-CURRENT #0: Sat Aug 29 
  00:31:14 EDT 2009 
  mwlu...@stretchlimo.blackhelicopters.org:/usr/obj/usr/src/sys/GENERIC  i386
  
  csup today.  no /etc/src.conf, /etc/make.conf only contains a perl
  definition.
  
  ...
  c++ -O2 -pipe -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/include 
  -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/tools/clang/include 
  -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/utils/TableGen -I. 
  -I/usr/src/usr.bin/clang/tblgen/../../../contrib/llvm/../../lib/clang/include
   -DLLVM_ON_UNIX -DLLVM_ON_FREEBSD -D__STDC_LIMIT_MACROS 
  -D__STDC_CONSTANT_MACROS -DLLVM_HOSTTRIPLE=\i386-unknown-freebsd9.0\ 
  -I/usr/obj/usr/src/tmp/legacy/usr/include  -static 
  -L/usr/obj/usr/src/tmp/legacy/usr/lib -o tblgen ARMDecoderEmitter.o 
  AsmMatcherEmitter.o AsmWriterEmitter.o AsmWriterInst.o CallingConvEmitter.o 
  CodeEmitterGen.o CodeGenDAGPatterns.o CodeGenInstruction.o 
  CodeGenRegisters.o CodeGenTarget.o DAGISelEmitter.o DAGISelMatcher.o 
  DAGISelMatcherEmitter.o DAGISelMatcherGen.o DAGISelMatcherOpt.o 
  DisassemblerEmitter.o EDEmitter.o FastISelEmitter.o 
  FixedLenDecoderEmitter.o InstrEnumEmitter.o InstrInfoEmitter.o 
  IntrinsicEmitter.o PseudoLoweringEmitter.o RegisterInfoEmi
 tter.o SetTheory.o StringMatcher.o SubtargetEmitter.o TGValueTypes.o 
 TableGen.o X86DisassemblerTables.o X86RecognizableInstr.o 
 /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmtablegen/libllvmtablegen.a
  
 /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmsupport/libllvmsupport.a
  -legacy
  /usr/obj/usr/src/tmp/usr/src/usr.bin/clang/tblgen/../../../lib/clang/libllvmsupport/libllvmsupport.a(Atomic.o)(.text+0x25):
   In function `llvm::sys::AtomicIncrement(unsigned int volatile*)':
  : undefined reference to `__sync_add_and_fetch_4'
 
 Hm, the only way I can imagine this happening, is then you compile your
 system with -march=i386.  In that case the atomic primitives, such as
 __sync_add_and_fetch, are called as external functions, which would lead
 to a link error like the above.
 
 I tried building 9-STABLE from a 9-CURRENT box that was not updated
 since April 2011, and from a fairly recent 8-STABLE, but both did not
 exhibit this issue.
 
 Are you using any special other settings, e.g. in your environment, or
 on the make command line?  In case, it would be handy if you post the
 full buildlog somewhere, so we can see which flags have been used to
 compile the llvmsupport library.
The system gcc was changed to assume march=486 some time ago.
I suppose that the current-9 system is before the change, see r198344.


pgpZInM9LLhcY.pgp
Description: PGP signature

vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-05 Thread Kostik Belousov

On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote:

Below is the KBI patch after vm_page_bits_t merge is done.
Again, I did not spent time converting all in-tree consumers
from the (potentially) loadable modules to the new KPI until it
is agreed upon.

diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c
index 305c189..7264cd1 100644
--- a/sys/nfsclient/nfs_bio.c
+++ b/sys/nfsclient/nfs_bio.c
@@ -128,7 +128,7 @@ nfs_getpages(struct vop_getpages_args *ap)
 * can only occur at the file EOF.
 */
VM_OBJECT_LOCK(object);
-   if (pages[ap-a_reqpage]-valid != 0) {
+   if (vm_page_read_valid(pages[ap-a_reqpage]) != 0) {
for (i = 0; i  npages; ++i) {
if (i != ap-a_reqpage) {
vm_page_lock(pages[i]);
@@ -198,16 +198,16 @@ nfs_getpages(struct vop_getpages_args *ap)
/*
 * Read operation filled an entire page
 */
-   m-valid = VM_PAGE_BITS_ALL;
-   KASSERT(m-dirty == 0,
+   vm_page_write_valid(m, VM_PAGE_BITS_ALL);
+   KASSERT(vm_page_read_dirty(m) == 0,
(nfs_getpages: page %p is dirty, m));
} else if (size  toff) {
/*
 * Read operation filled a partial page.
 */
-   m-valid = 0;
+   vm_page_write_valid(m, 0);
vm_page_set_valid(m, 0, size - toff);
-   KASSERT(m-dirty == 0,
+   KASSERT(vm_page_read_dirty(m) == 0,
(nfs_getpages: page %p is dirty, m));
} else {
/*
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index 389aea5..2f41e70 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2677,6 +2677,66 @@ vm_page_test_dirty(vm_page_t m)
vm_page_dirty(m);
 }
 
+void
+vm_page_lock_func(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_lock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_lock(vm_page_lockptr(m), 0, file, line);
+#endif
+}
+
+void
+vm_page_unlock_func(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line);
+#endif
+}
+
+int
+vm_page_trylock_func(vm_page_t m, const char *file, int line)
+{
+
+   return (_mtx_trylock(vm_page_lockptr(m), 0, file, line));
+}
+
+void
+vm_page_lock_assert_func(vm_page_t m, int a, const char *file, int line)
+{
+
+#ifdef INVARIANTS
+   _mtx_assert(vm_page_lockptr(m), a, file, line);
+#endif
+}
+
+vm_page_bits_t
+vm_page_read_dirty_func(vm_page_t m)
+{
+
+   return (m-dirty);
+}
+
+vm_page_bits_t
+vm_page_read_valid_func(vm_page_t m)
+{
+
+   return (m-valid);
+}
+
+void
+vm_page_write_valid_func(vm_page_t m, vm_page_bits_t v)
+{
+
+   m-valid = v;
+}
+
+
 int so_zerocp_fullpage = 0;
 
 /*
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 7099b70..4f8f71e 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -218,12 +218,50 @@ extern struct vpglocks pa_lock[];
 
 #definePA_LOCK_ASSERT(pa, a)   mtx_assert(PA_LOCKPTR(pa), (a))
 
+#ifdef KLD_MODULE
+#definevm_page_lock(m) vm_page_lock_func((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_unlock(m)   vm_page_unlock_func((m), LOCK_FILE, 
LOCK_LINE)
+#definevm_page_trylock(m)  vm_page_trylock_func((m), LOCK_FILE, 
LOCK_LINE)
+#ifdef INVARIANTS
+#definevm_page_lock_assert(m, a)   \
+vm_page_lock_assert_func((m), (a), LOCK_FILE, LOCK_LINE)
+#else
+#definevm_page_lock_assert(m, a)
+#endif
+
+#definevm_page_read_dirty(m)   vm_page_read_dirty_func((m))
+#definevm_page_read_valid(m)   vm_page_read_valid_func((m))
+#definevm_page_write_valid(m, v)   vm_page_write_valid_func((m), 
(v))
+
+#else  /* KLD_MODULE */
 #definevm_page_lockptr(m)  (PA_LOCKPTR(VM_PAGE_TO_PHYS((m
 #definevm_page_lock(m) mtx_lock(vm_page_lockptr((m)))
 #definevm_page_unlock(m)   mtx_unlock(vm_page_lockptr((m)))
 #definevm_page_trylock(m)  mtx_trylock(vm_page_lockptr((m)))
 #definevm_page_lock_assert(m, a)   
mtx_assert(vm_page_lockptr((m)), (a))
 
+static inline vm_page_bits_t
+vm_page_read_dirty(vm_page_t m)
+{
+
+   return (m-dirty);
+}
+
+static inline vm_page_bits_t
+vm_page_read_valid(vm_page_t m)
+{
+
+   return (m-valid);
+}
+
+static inline void
+vm_page_write_valid(vm_page_t m, vm_page_bits_t v)
+{
+
+   m-valid = v;
+}
+#endif
+
 #definevm_page_queue_free_mtx  vm_page_queue_free_lock.data
 /*
  * These are the flags defined for vm_page

Re: vm_page_t related KBI [Was: Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3]

2011-11-05 Thread Kostik Belousov

On Sat, Nov 05, 2011 at 07:37:48AM -0700, m...@freebsd.org wrote:
 On Sat, Nov 5, 2011 at 7:13 AM, Kostik Belousov kostik...@gmail.com wrote:
  On Fri, Nov 04, 2011 at 06:03:39PM +0200, Kostik Belousov wrote:
 
  Below is the KBI patch after vm_page_bits_t merge is done.
  Again, I did not spent time converting all in-tree consumers
  from the (potentially) loadable modules to the new KPI until it
  is agreed upon.
 
 I like my bikeshed orange...
 
 I would think a more canonical name would be get/set rather than
 read/write, especially since these operations aren't reading and
 writing the page (neither are they getting the page, but at least set
 is pretty unambiguous).
Look at the vm_page.h:385. vm_page_set_valid() is already taken.


pgpw68GTgJGiJ.pgp
Description: PGP signature

Re: 9.0/i386 build failure

2011-11-05 Thread Kostik Belousov

On Sat, Nov 05, 2011 at 08:06:26PM +0100, Dimitry Andric wrote:
 On 2011-11-05 14:28, Kostik Belousov wrote:
  On Sat, Nov 05, 2011 at 02:20:03PM +0100, Dimitry Andric wrote:
  On 2011-11-04 20:09, Michael W. Lucas wrote:
 ...
  : undefined reference to `__sync_add_and_fetch_4'
 ...
  The system gcc was changed to assume march=486 some time ago.
  I suppose that the current-9 system is before the change, see r198344.
 
 Yes, that is most likely the cause of this problem.  It is reproducible
 if you set CC to 'gcc -march=i386' and CXX to 'g++ -march=i386'.
 
 At first, I thought it would be easily fixable, since when you use
 -march=i386, the macro __tune_i386__ is defined, so you can disable
 LLVM's use of atomic builtins.
 
 However, since r212286, libstdc++ is also configured to use atomic
 builtins, and any non-trivial C++ program will encounter this linking
 issue when compiling with -march=i386.
 
 A workaround could be this:
 
 Index: gnu/lib/libstdc++/config.h
 ===
 --- gnu/lib/libstdc++/config.h  (revision 227112)
 +++ gnu/lib/libstdc++/config.h  (working copy)
 @@ -671,7 +671,7 @@
  /* #undef VERSION */
 
  /* Define if builtin atomic operations are supported on this host. */
 -#if defined(__amd64__) || defined(__i386__)
 +#if defined(__amd64__) || (defined(__i386__)  !defined(__tune_i386__))
  #define _GLIBCXX_ATOMIC_BUILTINS 1
  #endif
 
 but unfortunately during the bootstrap stage, the system includes are
 used, not those in the source tree...
 
 At the moment I don't know a clean way out of this, except setting CC to
 'gcc -march=i486' and CXX to 'g++ -march=i486', and then building the
 bootstrap-tools stage should at least complete successfully.
I believe that sources were bootstrapable after the commit, with the old
world. The solution could be to checkout exactly r198344, build and install
new world, and then checkout latest HEAD.

Personally, I would install from any snapshot, 9.0RC1 is good enough for
later HEAD rebuild.


pgpjNP4OIDsRB.pgp
Description: PGP signature

Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3

2011-11-04 Thread Kostik Belousov

On Thu, Nov 03, 2011 at 12:51:10PM -0500, Alan Cox wrote:
 On 11/03/2011 08:24, Kostik Belousov wrote:
 On Thu, Nov 03, 2011 at 12:40:08AM -0500, Alan Cox wrote:
 On 11/02/2011 05:32, Andriy Gapon wrote:
 [restored cc: to the original poster]
 As Bruce Evans has pointed to me privately [I am not sure why privately],
 there
 is already an example in i386 and amd64 atomic.h, where operations are
 inlined
 for a kernel build, but presented as real (external) functions for a 
 module
 build.  You can search e.g. sys/amd64/include/atomic.h for KLD_MODULE.
 
 I think that the same treatment could/should be applied to vm_page_*lock*
 operations defined in sys/vm/vm_page.h.
 *snip*
 
 Yes, it should be.  There are without question legitimate reasons for a
 module to acquire a page lock.
 I agree. Also, I think that we should use the opportunity to also isolate
 the modules from the struct vm_page layout changes. As example, I converted
 nfsclient.ko.
 
 
 I would suggest introducing the vm_page_bits_t change first.  If, at the 
 same time, you change the return type from the function vm_page_bits() 
 to use vm_page_bits_t, then I believe it is straightforward to fix all 
 of the places in vm_page.c that don't properly handle a 32 KB page size.
Ok, I think this is orhtohonal to the ABI issue. The vm_page_bits_t
applied.

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index f14da4a..f22c34a 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -137,7 +137,7 @@ SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD,
 
 static uma_zone_t fakepg_zone;
 
-static void vm_page_clear_dirty_mask(vm_page_t m, int pagebits);
+static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits);
 static void vm_page_queue_remove(int queue, vm_page_t m);
 static void vm_page_enqueue(int queue, vm_page_t m);
 static void vm_page_init_fakepg(void *dummy);
@@ -2350,7 +2350,7 @@ retrylookup:
  *
  * Inputs are required to range within a page.
  */
-int
+vm_page_bits_t
 vm_page_bits(int base, int size)
 {
int first_bit;
@@ -2367,7 +2367,8 @@ vm_page_bits(int base, int size)
first_bit = base  DEV_BSHIFT;
last_bit = (base + size - 1)  DEV_BSHIFT;
 
-   return ((2  last_bit) - (1  first_bit));
+   return (((vm_page_bits_t)2  last_bit) -
+   ((vm_page_bits_t)1  first_bit));
 }
 
 /*
@@ -2426,7 +2427,7 @@ vm_page_set_valid(vm_page_t m, int base, int size)
  * Clear the given bits from the specified page's dirty field.
  */
 static __inline void
-vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
+vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits)
 {
uintptr_t addr;
 #if PAGE_SIZE  16384
@@ -2455,7 +2456,6 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
 */
addr = (uintptr_t)m-dirty;
 #if PAGE_SIZE == 32768
-#error pagebits too short
atomic_clear_64((uint64_t *)addr, pagebits);
 #elif PAGE_SIZE == 16384
atomic_clear_32((uint32_t *)addr, pagebits);
@@ -2492,8 +2492,8 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
 void
 vm_page_set_validclean(vm_page_t m, int base, int size)
 {
-   u_long oldvalid;
-   int endoff, frag, pagebits;
+   vm_page_bits_t oldvalid, pagebits;
+   int endoff, frag;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
if (size == 0)  /* handle degenerate case */
@@ -2505,7 +2505,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size)
 * first block.
 */
if ((frag = base  ~(DEV_BSIZE - 1)) != base 
-   (m-valid  (1  (base  DEV_BSHIFT))) == 0)
+   (m-valid  ((vm_page_bits_t)1  (base  DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, frag, base - frag);
 
/*
@@ -2515,7 +2515,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size)
 */
endoff = base + size;
if ((frag = endoff  ~(DEV_BSIZE - 1)) != endoff 
-   (m-valid  (1  (endoff  DEV_BSHIFT))) == 0)
+   (m-valid  ((vm_page_bits_t)1  (endoff  DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, endoff,
DEV_BSIZE - (endoff  (DEV_BSIZE - 1)));
 
@@ -2585,7 +2585,7 @@ vm_page_clear_dirty(vm_page_t m, int base, int size)
 void
 vm_page_set_invalid(vm_page_t m, int base, int size)
 {
-   int bits;
+   vm_page_bits_t bits;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
KASSERT((m-oflags  VPO_BUSY) == 0,
@@ -2656,9 +2656,10 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid)
 int
 vm_page_is_valid(vm_page_t m, int base, int size)
 {
-   int bits = vm_page_bits(base, size);
+   vm_page_bits_t bits;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
+   bits = vm_page_bits(base, size);
if (m-valid  ((m-valid  bits) == bits))
return 1;
else
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 23637bb..7d1a529 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -113,6 +113,21 @@
 
 TAILQ_HEAD(pglist

Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3

2011-11-04 Thread Kostik Belousov

On Fri, Nov 04, 2011 at 10:09:09AM -0500, Alan Cox wrote:
 On 11/04/2011 05:08, Kostik Belousov wrote:
 On Thu, Nov 03, 2011 at 12:51:10PM -0500, Alan Cox wrote:
 I would suggest introducing the vm_page_bits_t change first.  If, at the
 same time, you change the return type from the function vm_page_bits()
 to use vm_page_bits_t, then I believe it is straightforward to fix all
 of the places in vm_page.c that don't properly handle a 32 KB page size.
 Ok, I think this is orhtohonal to the ABI issue. The vm_page_bits_t
 applied.
 
 Agreed, which is why I wanted to separate the two things.
 
 I've made a few comments below.
...

 Looks good.

I will make universe the patch below. Any further notes ?

diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index f14da4a..f398453 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -137,7 +137,7 @@ SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD,
 
 static uma_zone_t fakepg_zone;
 
-static void vm_page_clear_dirty_mask(vm_page_t m, int pagebits);
+static void vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits);
 static void vm_page_queue_remove(int queue, vm_page_t m);
 static void vm_page_enqueue(int queue, vm_page_t m);
 static void vm_page_init_fakepg(void *dummy);
@@ -2350,7 +2350,7 @@ retrylookup:
  *
  * Inputs are required to range within a page.
  */
-int
+vm_page_bits_t
 vm_page_bits(int base, int size)
 {
int first_bit;
@@ -2367,7 +2367,8 @@ vm_page_bits(int base, int size)
first_bit = base  DEV_BSHIFT;
last_bit = (base + size - 1)  DEV_BSHIFT;
 
-   return ((2  last_bit) - (1  first_bit));
+   return (((vm_page_bits_t)2  last_bit) -
+   ((vm_page_bits_t)1  first_bit));
 }
 
 /*
@@ -2426,7 +2427,7 @@ vm_page_set_valid(vm_page_t m, int base, int size)
  * Clear the given bits from the specified page's dirty field.
  */
 static __inline void
-vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
+vm_page_clear_dirty_mask(vm_page_t m, vm_page_bits_t pagebits)
 {
uintptr_t addr;
 #if PAGE_SIZE  16384
@@ -2455,7 +2456,6 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
 */
addr = (uintptr_t)m-dirty;
 #if PAGE_SIZE == 32768
-#error pagebits too short
atomic_clear_64((uint64_t *)addr, pagebits);
 #elif PAGE_SIZE == 16384
atomic_clear_32((uint32_t *)addr, pagebits);
@@ -2492,8 +2492,8 @@ vm_page_clear_dirty_mask(vm_page_t m, int pagebits)
 void
 vm_page_set_validclean(vm_page_t m, int base, int size)
 {
-   u_long oldvalid;
-   int endoff, frag, pagebits;
+   vm_page_bits_t oldvalid, pagebits;
+   int endoff, frag;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
if (size == 0)  /* handle degenerate case */
@@ -2505,7 +2505,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size)
 * first block.
 */
if ((frag = base  ~(DEV_BSIZE - 1)) != base 
-   (m-valid  (1  (base  DEV_BSHIFT))) == 0)
+   (m-valid  ((vm_page_bits_t)1  (base  DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, frag, base - frag);
 
/*
@@ -2515,7 +2515,7 @@ vm_page_set_validclean(vm_page_t m, int base, int size)
 */
endoff = base + size;
if ((frag = endoff  ~(DEV_BSIZE - 1)) != endoff 
-   (m-valid  (1  (endoff  DEV_BSHIFT))) == 0)
+   (m-valid  ((vm_page_bits_t)1  (endoff  DEV_BSHIFT))) == 0)
pmap_zero_page_area(m, endoff,
DEV_BSIZE - (endoff  (DEV_BSIZE - 1)));
 
@@ -2585,7 +2585,7 @@ vm_page_clear_dirty(vm_page_t m, int base, int size)
 void
 vm_page_set_invalid(vm_page_t m, int base, int size)
 {
-   int bits;
+   vm_page_bits_t bits;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
KASSERT((m-oflags  VPO_BUSY) == 0,
@@ -2625,7 +2625,7 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid)
 */
for (b = i = 0; i = PAGE_SIZE / DEV_BSIZE; ++i) {
if (i == (PAGE_SIZE / DEV_BSIZE) || 
-   (m-valid  (1  i))
+   (m-valid  ((vm_page_bits_t)1  i))
) {
if (i  b) {
pmap_zero_page_area(m, 
@@ -2656,9 +2656,10 @@ vm_page_zero_invalid(vm_page_t m, boolean_t setvalid)
 int
 vm_page_is_valid(vm_page_t m, int base, int size)
 {
-   int bits = vm_page_bits(base, size);
+   vm_page_bits_t bits;
 
VM_OBJECT_LOCK_ASSERT(m-object, MA_OWNED);
+   bits = vm_page_bits(base, size);
if (m-valid  ((m-valid  bits) == bits))
return 1;
else
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 23637bb..e3eb08c 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -113,6 +113,20 @@
 
 TAILQ_HEAD(pglist, vm_page);
 
+#if PAGE_SIZE == 4096
+#define VM_PAGE_BITS_ALL 0xffu
+typedef uint8_t vm_page_bits_t;
+#elif PAGE_SIZE == 8192
+#define VM_PAGE_BITS_ALL 0xu
+typedef uint16_t vm_page_bits_t;
+#elif PAGE_SIZE == 16384

Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3

2011-11-04 Thread Kostik Belousov

On Fri, Nov 04, 2011 at 10:48:45AM -0500, Alan Cox wrote:
 On 11/04/2011 10:30, Kostik Belousov wrote:
  for (b = i = 0; i= PAGE_SIZE / DEV_BSIZE; ++i) {
  if (i == (PAGE_SIZE / DEV_BSIZE) ||
 -(m-valid  (1  i))
 +(m-valid  ((vm_page_bits_t)1  i))
  ) {
  if (i  b) {
  pmap_zero_page_area(m,
 
 While we're here, we might as well fix the old style bug above by moving 
 the ) { to the proper place.

Fixed, this does not warrant the universe restart.


pgpxPx0PIpYaC.pgp
Description: PGP signature

Re: panic at vm_page_wire with FreeBSD 9.0 Beta 3

2011-11-03 Thread Kostik Belousov

On Thu, Nov 03, 2011 at 12:40:08AM -0500, Alan Cox wrote:
 On 11/02/2011 05:32, Andriy Gapon wrote:
 [restored cc: to the original poster]
 As Bruce Evans has pointed to me privately [I am not sure why privately], 
 there
 is already an example in i386 and amd64 atomic.h, where operations are 
 inlined
 for a kernel build, but presented as real (external) functions for a module
 build.  You can search e.g. sys/amd64/include/atomic.h for KLD_MODULE.
 
 I think that the same treatment could/should be applied to vm_page_*lock*
 operations defined in sys/vm/vm_page.h.
 *snip*
 
 Yes, it should be.  There are without question legitimate reasons for a 
 module to acquire a page lock.

I agree. Also, I think that we should use the opportunity to also isolate
the modules from the struct vm_page layout changes. As example, I converted
nfsclient.ko.

Patch is not tested.

diff --git a/sys/nfsclient/nfs_bio.c b/sys/nfsclient/nfs_bio.c
index 305c189..7264cd1 100644
--- a/sys/nfsclient/nfs_bio.c
+++ b/sys/nfsclient/nfs_bio.c
@@ -128,7 +128,7 @@ nfs_getpages(struct vop_getpages_args *ap)
 * can only occur at the file EOF.
 */
VM_OBJECT_LOCK(object);
-   if (pages[ap-a_reqpage]-valid != 0) {
+   if (vm_page_read_valid(pages[ap-a_reqpage]) != 0) {
for (i = 0; i  npages; ++i) {
if (i != ap-a_reqpage) {
vm_page_lock(pages[i]);
@@ -198,16 +198,16 @@ nfs_getpages(struct vop_getpages_args *ap)
/*
 * Read operation filled an entire page
 */
-   m-valid = VM_PAGE_BITS_ALL;
-   KASSERT(m-dirty == 0,
+   vm_page_write_valid(m, VM_PAGE_BITS_ALL);
+   KASSERT(vm_page_read_dirty(m) == 0,
(nfs_getpages: page %p is dirty, m));
} else if (size  toff) {
/*
 * Read operation filled a partial page.
 */
-   m-valid = 0;
+   vm_page_write_valid(m, 0);
vm_page_set_valid(m, 0, size - toff);
-   KASSERT(m-dirty == 0,
+   KASSERT(vm_page_read_dirty(m) == 0,
(nfs_getpages: page %p is dirty, m));
} else {
/*
diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
index f14da4a..5b8b4e3 100644
--- a/sys/vm/vm_page.c
+++ b/sys/vm/vm_page.c
@@ -2677,6 +2677,66 @@ vm_page_test_dirty(vm_page_t m)
vm_page_dirty(m);
 }
 
+void
+vm_page_lock_func(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_lock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_lock(vm_page_lockptr(m), 0, file, line);
+#endif
+}
+
+void
+vm_page_unlock_func(vm_page_t m, const char *file, int line)
+{
+
+#if LOCK_DEBUG  0 || defined(MUTEX_NOINLINE)
+   _mtx_unlock_flags(vm_page_lockptr(m), 0, file, line);
+#else
+   __mtx_unlock(vm_page_lockptr(m), curthread, 0, file, line);
+#endif
+}
+
+int
+vm_page_trylock_func(vm_page_t m, const char *file, int line)
+{
+
+   return (_mtx_trylock(vm_page_lockptr(m), 0, file, line));
+}
+
+void
+vm_page_lock_assert_func(vm_page_t m, int a, const char *file, int line)
+{
+
+#ifdef INVARIANTS
+   _mtx_assert(vm_page_lockptr(m), a, file, line);
+#endif
+}
+
+vm_page_bits_t
+vm_page_read_dirty_func(vm_page_t m)
+{
+
+   return (m-dirty);
+}
+
+vm_page_bits_t
+vm_page_read_valid_func(vm_page_t m)
+{
+
+   return (m-valid);
+}
+
+void
+vm_page_write_valid_func(vm_page_t m, vm_page_bits_t v)
+{
+
+   m-valid = v;
+}
+
+
 int so_zerocp_fullpage = 0;
 
 /*
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 23637bb..618ba2b 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -113,6 +113,21 @@
 
 TAILQ_HEAD(pglist, vm_page);
 
+#if PAGE_SIZE == 4096
+#define VM_PAGE_BITS_ALL 0xffu
+typedef uint8_t vm_page_bits_t;
+#elif PAGE_SIZE == 8192
+#define VM_PAGE_BITS_ALL 0xu
+typedef uint16_t vm_page_bits_t;
+#elif PAGE_SIZE == 16384
+#define VM_PAGE_BITS_ALL 0xu
+typedef uint32_t vm_page_bits_t;
+#elif PAGE_SIZE == 32768
+#define VM_PAGE_BITS_ALL 0xlu
+typedef uint64_t vm_page_bits_t;
+#endif
+
+
 struct vm_page {
TAILQ_ENTRY(vm_page) pageq; /* queue info for FIFO queue or free 
list (Q) */
TAILQ_ENTRY(vm_page) listq; /* pages in same object (O) */
@@ -138,19 +153,8 @@ struct vm_page {
/* NOTE that these must support one bit per DEV_BSIZE in a page!!! */
/* so, on normal X86 kernels, they must be at least 8 bits wide */
/* In reality, support for 32KB pages is not fully implemented. */
-#if PAGE_SIZE == 4096
-   uint8_t valid;  /* map of valid DEV_BSIZE chunks (O) */
-   uint8_t dirty;  /* map of dirty

Re: In-kernel API for tasks, which could wait?

2011-10-30 Thread Kostik Belousov

On Sun, Oct 30, 2011 at 06:54:51PM +0400, Lev Serebryakov wrote:
   So, I have question: what should I do if I need to perofrm ONE
 action, which could block for some time (for example, open file or
 create ALQ)?
 
  I could create thread for this. But it looks strange and too heavy: create 
 thread
 to call one function once.
 
  Maybe, kernel has some API to postpone such task to one,
 always-running idle thread?
See taskqueue(9). On the other hand, waiting for the enqueued task to
finish is itself the sleepable action.


pgpei4uyxuJ1X.pgp
Description: PGP signature

x86: how to get maximum possible CPU frequency ?

2011-10-30 Thread Kostik Belousov

Assume we are running on the single-package X86 machine, how to answer
the question: what is the possible maximum tsc frequency ?

I read tsc_levels_changed(), is it the right way to query the max
frequency for the general purpose driver ? If yes, could the code be
made into the utility function ?

BTW, there is some usage of the barriers in sysctl_machdep_tsc_freq()
that I do not understand. When the read from tsc_freq is performed
with the acquire semantic ? Why there are two store barriers when
tsc_freq and tsc_timecounter.tc_frequency are written ? I would think
that a single barrier on the last write is enough, but is it needed
at all ?


pgpG9zwXR7qbc.pgp
Description: PGP signature

Re: getting the cpuid for a userspace process ?

2011-10-25 Thread Kostik Belousov

On Tue, Oct 25, 2011 at 01:42:45PM -0400, John Baldwin wrote:
 On Tuesday, October 25, 2011 11:06:22 am Luigi Rizzo wrote:
  as the subject says... is there any way to get the current
  CPU id for a userspace process (of course,
  valid only at the time the function is called as the
  process might be arbitrarily moved while it runs)
 
 Not from userland, no.  On x86 you can use cpuid to fetch the APIC ID, but 
 that does not map 1:1 to FreeBSD cpu IDs.
Not quite so. The kern.proc sysctls do provide oncpu and lastcpu
information, which, I believe, is used by top. But this is very slow way
to get cpu id.


pgpjMU1cp088f.pgp
Description: PGP signature

Re: 9.0 RC1/Clang / illegal instruction (Signal 4) in gengtype while building cc_tools on i586.

2011-10-23 Thread Kostik Belousov

On Sun, Oct 23, 2011 at 10:24:12AM +0200, Roman Divacky wrote:
  Program received signal SIGILL, Illegal instruction.
  0x08048b24 in do_typedef (s=0x80532bf CUMULATIVE_ARGS, pos=0x805e1a4)
  at 
  /usr/src/gnu/usr.bin/cc/cc_tools/../../../../contrib/gcc/gengtype.c:103
  103 {
  
  (gdb) disas 0x08048b24
  Dump of assembler code for function do_typedef:
  0x08048b10 do_typedef+0:  push   %ebp
  0x08048b11 do_typedef+1:  mov%esp,%ebp
  0x08048b13 do_typedef+3:  push   %ebx
  0x08048b14 do_typedef+4:  push   %edi
  0x08048b15 do_typedef+5:  push   %esi
  0x08048b16 do_typedef+6:  sub$0xc,%esp
  0x08048b19 do_typedef+9:  mov$0x805e1d4,%edi
  0x08048b1e do_typedef+14: mov0x10(%ebp),%esi
  0x08048b21 do_typedef+17: mov0x8(%ebp),%ebx
  0x08048b24 do_typedef+20: nopw   %cs:0x0(%eax,%eax,1)
 
 LLVM attempts to use an optimal nop sequence when writing N-byte nop,
 by using these nop instructions
 
   static const uint8_t Nops[10][10] = {
 // nop
 {0x90},
 // xchg %ax,%ax
 {0x66, 0x90},
 // nopl (%[re]ax)
 {0x0f, 0x1f, 0x00},
 // nopl 0(%[re]ax)
 {0x0f, 0x1f, 0x40, 0x00},
 // nopl 0(%[re]ax,%[re]ax,1)
 {0x0f, 0x1f, 0x44, 0x00, 0x00},
 // nopw 0(%[re]ax,%[re]ax,1)
 {0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00},
 // nopl 0L(%[re]ax)
 {0x0f, 0x1f, 0x80, 0x00, 0x00, 0x00, 0x00},
 // nopl 0L(%[re]ax,%[re]ax,1)
 {0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00},
 // nopw 0L(%[re]ax,%[re]ax,1)
 {0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00},
 // nopw %cs:0L(%[re]ax,%[re]ax,1)
 {0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00},
   };
 
 There's no checking for a supported CPU, is it so that AMD geode doesnt 
 support any of these?
 Any other cpu that doesnt support these? If this is CPU dependant, I suggest 
 to open a bug
 report upstream as it's a bug.

Long nops are supported only on specific CPUs. Unconditional use of them
is a plain bug, like unconditional use of cmovXX.


pgpXKuJPtriZD.pgp
Description: PGP signature

Re: 9.0-RC1 panic in tcp_input: negative winow.

2011-10-23 Thread Kostik Belousov

On Sun, Oct 23, 2011 at 08:10:38AM +0200, Pawel Jakub Dawidek wrote:
 On Sun, Oct 23, 2011 at 12:35:15PM +1100, Lawrence Stewart wrote:
  On 10/22/11 19:49, Pawel Jakub Dawidek wrote:
   The panic message says:
  
 panic: tcp_input negative window: tp 0xfe007763e000 rcv_nxt 
   3718269252 rcv_adv 3718268291
  
   I only have picture of the backtrace:
  
 http://people.freebsd.org/~pjd/misc/panic_negative_window.jpg
  
  
  ewww that is not good. Can you give us any more information about the 
  machine and what it's doing? Is it terminating TCP connections from the 
  internet at large or only local LAN (i.e. is there likely to be packet 
  loss happening)? Are you doing TSO or LRO? Do you have any non-default 
  tuning in place?
 
 It is my local file server. It is doing NFS and AFP over LAN and also
 downloads files from the internet. It is triggered after few hours.
 I changed the KASSERT() into printf() and added printing 'win' variable
 and this is what got logged during the night:
 
 05:16:24 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107827269 
 rcv_adv 1107826256 win=242
 05:16:29 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107833451 
 rcv_adv 1107832977 win=880
 05:16:41 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1107849563 
 rcv_adv 1107848860 win=639
 05:20:02 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108108230 
 rcv_adv 1108107331 win=567
 05:24:30 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108433302 
 rcv_adv 1108432272 win=974
 05:24:46 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108450385 
 rcv_adv 1108450060 win=751
 05:26:44 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108574818 
 rcv_adv 1108573851 win=71
 05:28:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108654103 
 rcv_adv 1108653166 win=0
 05:28:43 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108692396 
 rcv_adv 1108691451 win=0
 05:30:06 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1108781258 
 rcv_adv 1108780372 win=235
 05:35:05 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109067578 
 rcv_adv 1109067335 win=663
 05:37:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109180403 
 rcv_adv 1109179411 win=0
 05:41:03 tcp_input negative window: tp 0xfe0026772b70 rcv_nxt 1109428265 
 rcv_adv 1109427375 win=170
 
 And the systems seems to be fine.
 
 I'm happy to test patches, but one round would take 24h.
 
 My suggestion would be that if we won't be able to fix it before 9.0,
 we should turn this assertion off, as the system seems to be able to
 recover.

Shipped kernels have all assertions turned off.


pgpKtmVl4jzPD.pgp
Description: PGP signature

Re: [RFC] Enable nxstack by default

2011-10-18 Thread Kostik Belousov

On Mon, Oct 17, 2011 at 09:30:56PM +0200, Oliver Pinter wrote:
 Hi all!
 
 I think, it's the time to enable the nxstack feature. Any comments,
 pros, cons?

I dragged the change long enough for it to miss the 9.0.
After the 9.0 is released, I will flip the switch with the following
change.

diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c
index 8455f48..926fe64 100644
--- a/sys/kern/imgact_elf.c
+++ b/sys/kern/imgact_elf.c
@@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0;
 SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, 
 elf_legacy_coredump, 0, );
 
-static int __elfN(nxstack) = 0;
+int __elfN(nxstack) =
+#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */
+   1;
+#else
+   0;
+#endif
 SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO,
 nxstack, CTLFLAG_RW, __elfN(nxstack), 0,
 __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable stack);
diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
index 7500462..0e27351 100644
--- a/sys/powerpc/aim/mmu_oea64.c
+++ b/sys/powerpc/aim/mmu_oea64.c
@@ -1445,6 +1445,8 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, 
u_int8_t *flags, int wait)
return (void *)va;
 }
 
+extern int elf32_nxstack;
+
 void
 moea64_init(mmu_t mmu)
 {
@@ -1464,6 +1466,8 @@ moea64_init(mmu_t mmu)
uma_zone_set_allocf(moea64_mpvo_zone,moea64_uma_page_alloc);
}
 
+   elf32_nxstack = 1;
+
moea64_initialized = TRUE;
 }
 
diff --git a/sys/powerpc/booke/machdep.c b/sys/powerpc/booke/machdep.c
index c2b5e6f..82a37e1 100644
--- a/sys/powerpc/booke/machdep.c
+++ b/sys/powerpc/booke/machdep.c
@@ -192,6 +192,8 @@ void print_kernel_section_addr(void);
 void print_kenv(void);
 u_int booke_init(uint32_t, uint32_t);
 
+extern int elf32_nxstack;
+
 static void
 cpu_e500_startup(void *dummy)
 {
@@ -227,6 +229,9 @@ cpu_e500_startup(void *dummy)
/* Set up buffers, so they can be used to read disk labels. */
bufinit();
vm_pager_bufferinit();
+
+   /* Cpu supports execution permissions on the pages. */
+   elf32_nxstack = 1;
 }
 
 static char *



pgpYBjVPeBl7n.pgp
Description: PGP signature

Re: freebsd-9.0 smartmontools and ada devices

2011-10-18 Thread Kostik Belousov

On Tue, Oct 18, 2011 at 11:02:42AM +0200, John Hay wrote:
 On Tue, Oct 18, 2011 at 09:39:24AM +0200, John Hay wrote:
  Hi Guys,
  
  I have upgraded my desktop from 8.2-stable to 9.0-RC1 (from source), using
  a GENERIC kernel. I have installed the smartmontools-5.41_3 package from
  a mirror and found that smartmontools does not like the ada devices anymore.
  Previously (8.2) I had a GENERIC kernel, with ahci loaded in loader.conf.
  There an older smartmontools (5.40) worked without a problem on the ada
  devices.
  
  The output of smartctl looks like this:
  
  #
  dolphin# smartctl -a /dev/ada0
  smartctl 5.41 2011-06-09 r3365 [FreeBSD 9.0-RC1 amd64] (local build)
  Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
  
  error sending CAMIOCOMMAND ioctl: Inappropriate ioctl for device
  Unable to get CAM device list
  /dev/ada0: Unable to detect device type
  Smartctl: please specify device type with the -d option.
  
  Use smartctl -h to get a usage summary
  
  #
 
 Just to follow up on myself. :-( I have build smartmontools from ports and
 even though it is the same version, it works. So for me the package
 amd64/packages-9-current/All/smartmontools-5.41_3.tbz did not work, but the
 port does.

CAM ABI was changed right before RC1. The issue was mentioned in the Ken'
announcement. The packages were obviously built with the old headers.


pgpCP3eHnNxYY.pgp
Description: PGP signature

Re: [RFC] Enable nxstack by default

2011-10-18 Thread Kostik Belousov

On Tue, Oct 18, 2011 at 01:06:27PM -0400, Arnaud Lacombe wrote:
 Hi,
 
 On Tue, Oct 18, 2011 at 12:53 PM, Oliver Pinter oliver.p...@gmail.com wrote:
  On 10/18/11, Arnaud Lacombe lacom...@gmail.com wrote:
  Hi,
 
  On Tue, Oct 18, 2011 at 11:44 AM, Garrett Cooper yaneg...@gmail.com 
  wrote:
  On Tue, 18 Oct 2011, Arnaud Lacombe wrote:
 
  Hi,
 
  On Tue, Oct 18, 2011 at 5:07 AM, Kostik Belousov kostik...@gmail.com
  wrote:
 
  On Mon, Oct 17, 2011 at 09:30:56PM +0200, Oliver Pinter wrote:
 
  Hi all!
 
  I think, it's the time to enable the nxstack feature. Any comments,
  pros, cons?
 
  I dragged the change long enough for it to miss the 9.0.
  After the 9.0 is released, I will flip the switch with the following
  change.
 
  diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c
  index 8455f48..926fe64 100644
  --- a/sys/kern/imgact_elf.c
  +++ b/sys/kern/imgact_elf.c
  @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0;
   SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW,
      elf_legacy_coredump, 0, );
 
  -static int __elfN(nxstack) = 0;
  +int __elfN(nxstack) =
  +#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit
  */
 
  Why leaving 32bits x86 CPU supporting the NX feature behind ?
 
  Most likely because it was assumed that i386 doesn't fully support it.
  According to ye great Wikipedia, NX support didn't roll into i386 until
  Prescott, which was pretty late in the non-64-bit capable family of CPUs,
  as
  its successor -- Conroe -- was 64-bit. Intel detuned some of the early
  Dual
  Core Pentiums, e.g. the Yonahs to not talk 64-bit. Not sure about AMD.
 
  There are probably more details in binutils, gcc, etc, that I'm missing
  and
  Kostik can expound on.
 
  NX support is advertised in the cpuid flags, just add the logic to
  handle this interface. Kostik's patch is just incomplete, but he's got
  a commit bit so he can commit it as-is, as he will.
 
  If nonexec_stack becomes the default, it should be on every CPU
  supporting the feature, not just the low-hanging one.
 
   - Arnaud
 
 
  the NX detection code already implemented in i386, but this feature
  required PAE:
 
 yes, this is the conclusion I reached too. But this does not change
 the fact that the VM should know about that, and this is missing from
 Kostik's patch. I guess the first hunk should read:
 
 @@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0;
  SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW,
 elf_legacy_coredump, 0, );
 
 -static int __elfN(nxstack) = 0;
 +int __elfN(nxstack) =
 +#if defined(PAE) || defined(__amd64__) || defined(__powerpc64__) /*
 both 64 and 32 bit */
 +   1;
 +#else
 +   0;
 +#endif
  SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO,
 nxstack, CTLFLAG_RW, __elfN(nxstack), 0,
 __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable 
 stack);

Your patch is not right, it will cause even more user confusion.
The presence of the PAE kernel does not imply that CPU supports nx.

Below is the updated patch that turns on nxstack by default for the PAE
kernels on NX-capable CPUs. Note that i386 usermode fully supports the
PT_GNU_STACK annotations and cares about not enabling nx on stack pages
unneccessary, but my main target was compat32 on amd64.

The fact that nxstack is not enabled by default does not prevent
administrator from manually enabling the feature.

Since you worried so much about PAE case, I sincerely expect that you
will test the change. Thank you in advance.

diff --git a/sys/i386/i386/initcpu.c b/sys/i386/i386/initcpu.c
index c2daf54..ec77adb 100644
--- a/sys/i386/i386/initcpu.c
+++ b/sys/i386/i386/initcpu.c
@@ -650,6 +650,8 @@ enable_sse(void)
 #endif
 }
 
+extern int elf32_nxstack;
+
 void
 initializecpu(void)
 {
@@ -739,6 +741,7 @@ initializecpu(void)
msr = rdmsr(MSR_EFER) | EFER_NXE;
wrmsr(MSR_EFER, msr);
pg_nx = PG_NX;
+   elf32_nxstack = 1;
}
 #endif
break;
diff --git a/sys/kern/imgact_elf.c b/sys/kern/imgact_elf.c
index 8455f48..926fe64 100644
--- a/sys/kern/imgact_elf.c
+++ b/sys/kern/imgact_elf.c
@@ -118,7 +118,12 @@ static int elf_legacy_coredump = 0;
 SYSCTL_INT(_debug, OID_AUTO, __elfN(legacy_coredump), CTLFLAG_RW, 
 elf_legacy_coredump, 0, );
 
-static int __elfN(nxstack) = 0;
+int __elfN(nxstack) =
+#if defined(__amd64__) || defined(__powerpc64__) /* both 64 and 32 bit */
+   1;
+#else
+   0;
+#endif
 SYSCTL_INT(__CONCAT(_kern_elf, __ELF_WORD_SIZE), OID_AUTO,
 nxstack, CTLFLAG_RW, __elfN(nxstack), 0,
 __XSTRING(__CONCAT(ELF, __ELF_WORD_SIZE)) : enable non-executable stack);
diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
index 7500462..0e27351 100644
--- a/sys/powerpc/aim/mmu_oea64.c
+++ b/sys/powerpc/aim/mmu_oea64.c
@@ -1445,6 +1445,8 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, 
u_int8_t

Re: pmap_qenter() - the page must be wired - is violated

2011-10-06 Thread Kostik Belousov

On Thu, Oct 06, 2011 at 01:45:16PM +0200, Svatopluk Kraus wrote:
 On Wed, Oct 5, 2011 at 4:53 PM, Kostik Belousov kostik...@gmail.com wrote:
  On Wed, Oct 05, 2011 at 02:28:01PM +0200, Svatopluk Kraus wrote:
  Hi,
 
    I found out that on a few places pmap_qenter() is called on pages
  which are not wired. For example, in the following functions, when
  vm_pager_get_pages() is called, the pages are not wired:
 
    exec_map_first_page() in sys/kern/kern_exec.c
    vm_fault_hold() in sys/vm/vm_fault.c
    vm_imgact_hold_page() in sys/vm/vm_glue.c
    vm_object_populate() in sys/vm/vm_object.c
    mdstart_swap() in sys/dev/md/md.c
 
    Is the rule violated or the rule should be changed?
 
  Lets first discuss where did you found the calls to pmap_qenter().
  Can you point out exact line numbers of the calls to pmap_qenter()
  that you consider problematic ?
 
  In fact, the requirement probably shall be 'no swapout allowed'. E.g.,
  the busy page is fully qualified to be used together with pmap_qenter().
 
 Well, I just follow description above pmap_qenter() blindly and test for
 page wire_count inside the function. All function calls, I mentioned before,
 are OK as pages are VPO_BUSY'd. Thanks for your explanation.
Still, I am not aware of any (direct) calls to pmap_qenter in the mentioned
functions.


pgpY4rX1RvV03.pgp
Description: PGP signature

Re: cvsup broken on amd64?

2011-10-06 Thread Kostik Belousov

On Wed, Oct 05, 2011 at 03:21:45PM -0700, David O'Brien wrote:
 On Fri, Sep 09, 2011 at 06:00:02PM +0300, Kostik Belousov wrote:
  --- libs/m3core/src/thread/POSIX/ThreadPosix.m3.orig2011-09-09 
  17:58:12.867431639 +0300
  +++ libs/m3core/src/thread/POSIX/ThreadPosix.m3 2011-09-09 
  17:58:30.380428486 +0300
  @@ -180,7 +180,7 @@
 pausedThreads : T;
 selected_interval:= UTime{0, 100 * 1000};
   
  -  defaultStackSize := 3000;
  +  defaultStackSize := 1;
 
 This might not be a large enough value (depending on the unit of measure).
 
 I synced tzdata+tzcode at $WORK and we found the amount of stack used by
 tzload() alone is now quite large -- 41k on ARM.

Please see r225677.


pgpjpUkg0NdSp.pgp
Description: PGP signature

Re: pmap_qenter() - the page must be wired - is violated

2011-10-05 Thread Kostik Belousov

On Wed, Oct 05, 2011 at 02:28:01PM +0200, Svatopluk Kraus wrote:
 Hi,
 
   I found out that on a few places pmap_qenter() is called on pages
 which are not wired. For example, in the following functions, when
 vm_pager_get_pages() is called, the pages are not wired:
 
   exec_map_first_page() in sys/kern/kern_exec.c
   vm_fault_hold() in sys/vm/vm_fault.c
   vm_imgact_hold_page() in sys/vm/vm_glue.c
   vm_object_populate() in sys/vm/vm_object.c
   mdstart_swap() in sys/dev/md/md.c
 
   Is the rule violated or the rule should be changed?

Lets first discuss where did you found the calls to pmap_qenter().
Can you point out exact line numbers of the calls to pmap_qenter()
that you consider problematic ?

In fact, the requirement probably shall be 'no swapout allowed'. E.g.,
the busy page is fully qualified to be used together with pmap_qenter().


pgpbn7etyrLfS.pgp
Description: PGP signature

Re: st_dev and st_ino for pipes

2011-10-04 Thread Kostik Belousov

On Tue, Oct 04, 2011 at 05:37:27PM +1100, Peter Jeremy wrote:
 On 2011-Oct-03 01:04:05 +0300, Kostik Belousov kostik...@gmail.com wrote:
 Our implementation of pipes does not provide useful values for st_dev
 and st_ino when stat(2) is done on an anonymous pipe. It was noted by the
 ...
 Patch below implements the requirement, by the cost of the small overhead
 at the pipe creation time, and slightly bigger cost at the destruction.
 
 Does it need to be so complex?  This information isn't needed by the
 kernel and, to be meaningful, all that is required is that the
 (st_dev,st_ino) pair is unique within the system.  Given this,
 wouldn't it be sufficient to fake up a st_dev and then just make
 st_ino be a counter that starts from 0 and increments (atomically?) on
 every new pipe?  No need to retain state or free anything when the
 pipe is destroyed.  (If necessary, pick a new fake st_dev when st_ino
 wraps).
Yes, you described the problem. The (st_dev, st_ino) pair must
be globally unique in system, not only for the pipes, but for whole
domain of file descriptors. This is the reason that I allocate a value
for pipe st_dev using the same devfs mechanism as it is done for device
numbers. Using simple incrementing counter for pipe inodes, together with
allocating yet another st_dev gives essentially the same complications
wrt KPI, and feels unclean.

 
 --- a/sys/kern/sys_pipe.c
 +++ b/sys/kern/sys_pipe.c
 ...
 +static ino_t pipedev_ino;
 ..
 +ub-st_dev = pipedev_ino;
 
 st_dev is a dev_t and hence pipedev_ino (which seems misnamed to me)
 should probably be dev_t rather than ino_t
Fixed.

Thank you.


pgpvSZLW4YilD.pgp
Description: PGP signature

Re: x220 notes

2011-10-04 Thread Kostik Belousov

On Tue, Oct 04, 2011 at 12:26:27PM -0700, matt wrote:
 On Tue, 4 Oct 2011 02:21:59 -0400 (EDT)
 Benjamin Kaduk ka...@mit.edu wrote:
 
  On Mon, 3 Oct 2011, matt wrote:
  
   Ultimately, I think if we can set backlight, we can fix the screen
   after resume...I think it's just the backlight is low/off on
   resume...
  
  Can you use a very strong fronglight to see if this is actually the
  case? (Illumination angle may be important as well.)  The state of
  the liquid crystal polarizing switches should be visible under
  sufficient illumination, and whether they change with input would be
  pretty definitive.
  I actually used to operate my iBook G4 in frontlight mode in bright 
  sunlight for a while (until the wifi card flaked out, tying it to my 
  desk).
  
  -Ben Kaduk
 
 It's the IPS display, so there's a bit of glare...not much visible. I
 can't tell for sure...
 
 Friend had idea of trying to resume under X with external
 VGA display...both remained off, however I am using the Intel GPU
 experimental drivers (To make a note, resume is broken without them as
 well).
Suspend/resume is not implemented in the KMS driver (yet).

 
 I can't get vbetool from ports to do anything with the display other
 than cause it to turn off.
 
 
 I can type commands just fine under X or console with the display
 off...so I have tried xset, vbetool. hw.acpi.reset_video causes a
 majorly bad reboot loop on resume with a flashing thinklight.
 
 Matt
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


pgpFzmFrenynw.pgp
Description: PGP signature

Re: x220 notes

2011-10-04 Thread Kostik Belousov

On Tue, Oct 04, 2011 at 02:38:06PM -0700, matt wrote:
 
   Friend had idea of trying to resume under X with external
   VGA display...both remained off, however I am using the Intel GPU
   experimental drivers (To make a note, resume is broken without them
   as well).
  Suspend/resume is not implemented in the KMS driver (yet).
  
 
 On x220 problem is deeper than KMS, since it exists without
 KMS patches applied. Thank you for graphics support! Even compiz works,
 although second output stuff is overlapped when extending on this
No idea what do you mean.

 one...also lots of update_fbc, but this is probably known.
Write 0 into sysctl hw.dri.debug to get rid of KMS debugging output.

 
 There are so far about 4 patches required agains HEAD for this machine
 to be usable...KMS, acpi-ibm patches, iwn patches, and currently
 newvers.sh.
 
 I have done testing with all these rolled back, problem is still
 there...I think it could be UEFI/ACPI related? vbetool, dpms.ko,
 acpi_video, none can talk to the monitor. Does anyone know if
 system/6636 was fixed for OpenBSD? 
 
  Matt


pgpmwf2l1fljv.pgp
Description: PGP signature

st_dev and st_ino for pipes

2011-10-02 Thread Kostik Belousov

Our implementation of pipes does not provide useful values for st_dev
and st_ino when stat(2) is done on an anonymous pipe. It was noted by the
people outside the project, e.g. Perl contains a workaround in one
of its modules, submitted by Debian/kFreeBSD developers, see
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=537555
and the commit 16f708c9bc0dc48713b200 in the Perl git.

I think this is a non-conformance, since SUSv4 explicitely states
in the description of stat(2)
For all other file types defined in this volume of POSIX.1-2008, the
structure members st_mode, st_ino, st_dev, st_uid, st_gid, st_atim,
st_ctim, and st_mtim shall have meaningful values 

Patch below implements the requirement, by the cost of the small overhead
at the pipe creation time, and slightly bigger cost at the destruction.

Any comments ?

diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
index a2d3222..a111527 100644
--- a/sys/fs/devfs/devfs_devs.c
+++ b/sys/fs/devfs/devfs_devs.c
@@ -171,8 +171,7 @@ devfs_free(struct cdev *cdev)
cdp = cdev2priv(cdev);
if (cdev-si_cred != NULL)
crfree(cdev-si_cred);
-   if (cdp-cdp_inode  0)
-   free_unr(devfs_inos, cdp-cdp_inode);
+   devfs_free_cdp_inode(cdp-cdp_inode);
if (cdp-cdp_maxdirent  0) 
free(cdp-cdp_dirents, M_DEVFS2);
free(cdp, M_CDEVP);
@@ -394,7 +393,7 @@ devfs_delete(struct devfs_mount *dm, struct devfs_dirent 
*de, int flags)
mac_devfs_destroy(de);
 #endif
if (de-de_inode  DEVFS_ROOTINO) {
-   free_unr(devfs_inos, de-de_inode);
+   devfs_free_cdp_inode(de-de_inode);
de-de_inode = 0;
}
if (DEVFS_DE_DROP(de))
@@ -685,6 +684,21 @@ devfs_destroy(struct cdev *dev)
devfs_generation++;
 }
 
+ino_t
+devfs_alloc_cdp_inode(void)
+{
+
+   return (alloc_unr(devfs_inos));
+}
+
+void
+devfs_free_cdp_inode(ino_t ino)
+{
+
+   if (ino  0)
+   free_unr(devfs_inos, ino);
+}
+
 static void
 devfs_devs_init(void *junk __unused)
 {
diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c
index b7ae521..2edecf2 100644
--- a/sys/kern/sys_pipe.c
+++ b/sys/kern/sys_pipe.c
@@ -93,6 +93,7 @@ __FBSDID($FreeBSD$);
 
 #include sys/param.h
 #include sys/systm.h
+#include sys/conf.h
 #include sys/fcntl.h
 #include sys/file.h
 #include sys/filedesc.h
@@ -224,6 +225,8 @@ static int  pipe_zone_init(void *mem, int size, int flags);
 static voidpipe_zone_fini(void *mem, int size);
 
 static uma_zone_t pipe_zone;
+static struct unrhdr *pipeino_unr;
+static ino_t pipedev_ino;
 
 SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_ANY, pipeinit, NULL);
 
@@ -235,6 +238,10 @@ pipeinit(void *dummy __unused)
pipe_zone_ctor, NULL, pipe_zone_init, pipe_zone_fini,
UMA_ALIGN_PTR, 0);
KASSERT(pipe_zone != NULL, (pipe_zone not initialized));
+   pipeino_unr = new_unrhdr(1, INT32_MAX, NULL);
+   KASSERT(pipeino_unr != NULL, (pipe fake inodes not initialized));
+   pipedev_ino = devfs_alloc_cdp_inode();
+   KASSERT(pipedev_ino  0, (pipe dev inode not initialized));
 }
 
 static int
@@ -562,6 +569,12 @@ pipe_create(pipe, backing)
/* If we're not backing this pipe, no need to do anything. */
error = 0;
}
+   if (error == 0) {
+   pipe-pipe_ino = alloc_unr(pipeino_unr);
+   if (pipe-pipe_ino == -1)
+   /* pipeclose will clear allocated kva */
+   error = ENOMEM;
+   }
return (error);
 }
 
@@ -1408,9 +1421,10 @@ pipe_stat(fp, ub, active_cred, td)
ub-st_ctim = pipe-pipe_ctime;
ub-st_uid = fp-f_cred-cr_uid;
ub-st_gid = fp-f_cred-cr_gid;
+   ub-st_dev = pipedev_ino;
+   ub-st_ino = pipe-pipe_ino;
/*
-* Left as 0: st_dev, st_ino, st_nlink, st_rdev, st_flags, st_gen.
-* XXX (st_dev, st_ino) should be unique.
+* Left as 0: st_nlink, st_rdev, st_flags, st_gen.
 */
return (0);
 }
@@ -1463,6 +1477,7 @@ pipeclose(cpipe)
 {
struct pipepair *pp;
struct pipe *ppipe;
+   ino_t ino;
 
KASSERT(cpipe != NULL, (pipeclose: cpipe == NULL));
 
@@ -1521,6 +1536,12 @@ pipeclose(cpipe)
knlist_destroy(cpipe-pipe_sel.si_note);
 
/*
+* Postpone the destroy of the fake inode number allocated for
+* our end, until pipe mtx is unlocked.
+*/
+   ino = cpipe-pipe_ino;
+
+   /*
 * If both endpoints are now closed, release the memory for the
 * pipe pair.  If not, unlock.
 */
@@ -1532,6 +1553,9 @@ pipeclose(cpipe)
uma_zfree(pipe_zone, cpipe-pipe_pair);
} else
PIPE_UNLOCK(cpipe);
+
+   if (ino  0)
+   free_unr(pipeino_unr, cpipe-pipe_ino);
 }
 
 /*ARGSUSED*/
diff --git a/sys/sys/conf.h b/sys/sys/conf.h
index 08e1582..7acd2e0 100644
--- a/sys/sys/conf.h
+++ b/sys/sys/conf.h
@@

Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero

2011-09-29 Thread Kostik Belousov

On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote:
 Hello!
 
 I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports
 (I've upgraded from 8.2). I'm getting panic. Is it known
 problem/already fixed somewhere?
 
 FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28
 17:11:17 EEST 2011 r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9
 i386
 
 Unread portion of the kernel message buffer:
 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
 cpuid = 1
 Uptime: 16h6m53s
 Physical memory: 1904 MB
 Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160
 144 128 112 96 80 64 48 32 16
 
 #0  doadump (textdump=1) at pcpu.h:244
 #1  0xc071e5cb in kern_reboot (howto=260)
 at /usr/src/sys/kern/kern_shutdown.c:442
 #2  0xc071e82b in panic (fmt=Variable fmt is not available.
 ) at /usr/src/sys/kern/kern_shutdown.c:607
 #3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0)
 at /usr/src/sys/vm/vm_page.c:1905
Please do frame 2, then p/x *m and show the result.

 #4  0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4)
 at /usr/src/sys/kern/vfs_bio.c:1638
 #5  0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0,
 size=16384, maxsize=16384, gbflags=0) at /usr/src/sys/kern/vfs_bio.c:1949
 #6  0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0,
 slptimeo=0, flags=Variable flags is not available.
 ) at /usr/src/sys/kern/vfs_bio.c:2788
 #7  0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, lbn=2520,
 blkno=1209440, size=16384, run=Variable run is not available.
 ) at /usr/src/sys/kern/vfs_cluster.c:332
 #8  0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088,
 lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7,
 bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254
 #9  0xc0934cf5 in ffs_read (ap=0xf5824bac)
 at /usr/src/sys/ufs/ffs/ffs_vnops.c:514
 #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac)
 at vnode_if.c:887
 #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48,
 active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384
 #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508,
 auio=0xf5824c48, offset=41189376, flags=1) at file.h:254
 #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48,
 offset=41189376) at /usr/src/sys/kern/sys_generic.c:288
 #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec)
 at /usr/src/sys/kern/sys_generic.c:189
 #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131
 #16 0xc0996db1 in Xint0x80_syscall ()
 at /usr/src/sys/i386/i386/exception.s:266
 #17 0x0033 in ?? ()
 Previous frame inner to this frame (corrupt stack?)
 
 -- 
 Alexandr Kovalenko
 http://uafug.org.ua/
 ___
 freebsd-sta...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


pgpcZFnkVkBAg.pgp
Description: PGP signature

Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero

2011-09-29 Thread Kostik Belousov

On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote:
 On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com wrote:
  On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote:
  Hello!
 
  I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports
  (I've upgraded from 8.2). I'm getting panic. Is it known
  problem/already fixed somewhere?
 
  FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28
  17:11:17 EEST 2011     r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9
  i386
 
  Unread portion of the kernel message buffer:
  panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
  cpuid = 1
  Uptime: 16h6m53s
  Physical memory: 1904 MB
  Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160
  144 128 112 96 80 64 48 32 16
 
  #0  doadump (textdump=1) at pcpu.h:244
  #1  0xc071e5cb in kern_reboot (howto=260)
      at /usr/src/sys/kern/kern_shutdown.c:442
  #2  0xc071e82b in panic (fmt=Variable fmt is not available.
  ) at /usr/src/sys/kern/kern_shutdown.c:607
  #3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0)
      at /usr/src/sys/vm/vm_page.c:1905
  Please do frame 2, then p/x *m and show the result.
 
 (kgdb) frame 2
frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well.

 #2  0xc071e82b in panic (fmt=Variable fmt is not available.) at
 /usr/src/sys/kern/kern_shutdown.c:607
 607 kern_reboot(bootopt);
 (kgdb) p/x *m
 No symbol m in current context.
 
 
  #4  0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4)
      at /usr/src/sys/kern/vfs_bio.c:1638
  #5  0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0,
      size=16384, maxsize=16384, gbflags=0) at 
  /usr/src/sys/kern/vfs_bio.c:1949
  #6  0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, slpflag=0,
      slptimeo=0, flags=Variable flags is not available.
  ) at /usr/src/sys/kern/vfs_bio.c:2788
  #7  0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, 
  lbn=2520,
      blkno=1209440, size=16384, run=Variable run is not available.
  ) at /usr/src/sys/kern/vfs_cluster.c:332
  #8  0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088,
      lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7,
      bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254
  #9  0xc0934cf5 in ffs_read (ap=0xf5824bac)
      at /usr/src/sys/ufs/ffs/ffs_vnops.c:514
  #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac)
      at vnode_if.c:887
  #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48,
      active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384
  #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508,
      auio=0xf5824c48, offset=41189376, flags=1) at file.h:254
  #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48,
      offset=41189376) at /usr/src/sys/kern/sys_generic.c:288
  #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec)
      at /usr/src/sys/kern/sys_generic.c:189
  #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131
  #16 0xc0996db1 in Xint0x80_syscall ()
      at /usr/src/sys/i386/i386/exception.s:266
  #17 0x0033 in ?? ()
  Previous frame inner to this frame (corrupt stack?)
 
  --
  Alexandr Kovalenko
  http://uafug.org.ua/
  ___
  freebsd-sta...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 
 
 
 -- 
 Alexandr Kovalenko
 http://uafug.org.ua/


pgplpD93nm6tL.pgp
Description: PGP signature

Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero

2011-09-29 Thread Kostik Belousov

On Thu, Sep 29, 2011 at 03:51:53PM +0300, Alexandr Kovalenko wrote:
 2011/9/29 Kostik Belousov kostik...@gmail.com:
  On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote:
  On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com 
  wrote:
   On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote:
   Hello!
  
   I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports
   (I've upgraded from 8.2). I'm getting panic. Is it known
   problem/already fixed somewhere?
Do you use custom kernel config ? Is there a chance you have ZERO_COPY_SOCKETS
option enabled ?

  
   FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 28
   17:11:17 EEST 2011     r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9
   i386
  
   Unread portion of the kernel message buffer:
   panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
   cpuid = 1
   Uptime: 16h6m53s
   Physical memory: 1904 MB
   Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160
   144 128 112 96 80 64 48 32 16
  
   #0  doadump (textdump=1) at pcpu.h:244
   #1  0xc071e5cb in kern_reboot (howto=260)
       at /usr/src/sys/kern/kern_shutdown.c:442
   #2  0xc071e82b in panic (fmt=Variable fmt is not available.
   ) at /usr/src/sys/kern/kern_shutdown.c:607
   #3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0)
       at /usr/src/sys/vm/vm_page.c:1905
   Please do frame 2, then p/x *m and show the result.
 
  (kgdb) frame 2
  frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well.
 
 (kgdb) frame 3
 #3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at
 /usr/src/sys/vm/vm_page.c:1905
 1905panic(vm_page_unwire: page %p's wire count is
 zero, m);
 (kgdb) p/x *(struct vm_page *)0xc2a38dc8
 $1 = {pageq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b0}, listq =
 {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b8}, left = 0x0, right =
 0x0, object = 0xc5725770, pindex = 0xbd3, phys_addr = 0x56a32000, md =
 {pv_list = {tqh_first = 0xc3cc6418, tqh_last = 0xc3cc641c},
 pat_mode = 0x6}, queue = 0x1, segind = 0x2, hold_count = 0x0,
 order = 0xb, pool = 0x0, cow = 0x0, wire_count = 0x0, aflags = 0x3,
 flags = 0x0, oflags = 0x0, act_count = 0x5, busy = 0x0, valid = 0xff,
 dirty = 0xff}

Please show the output of p *(struct vm_object *)0xc5725770 from kgdb.
 
 
  #2  0xc071e82b in panic (fmt=Variable fmt is not available.) at
  /usr/src/sys/kern/kern_shutdown.c:607
  607             kern_reboot(bootopt);
  (kgdb) p/x *m
  No symbol m in current context.
 
 
   #4  0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4)
       at /usr/src/sys/kern/vfs_bio.c:1638
   #5  0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0,
       size=16384, maxsize=16384, gbflags=0) at 
   /usr/src/sys/kern/vfs_bio.c:1949
   #6  0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, 
   slpflag=0,
       slptimeo=0, flags=Variable flags is not available.
   ) at /usr/src/sys/kern/vfs_bio.c:2788
   #7  0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, 
   lbn=2520,
       blkno=1209440, size=16384, run=Variable run is not available.
   ) at /usr/src/sys/kern/vfs_cluster.c:332
   #8  0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088,
       lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7,
       bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254
   #9  0xc0934cf5 in ffs_read (ap=0xf5824bac)
       at /usr/src/sys/ufs/ffs/ffs_vnops.c:514
   #10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac)
       at vnode_if.c:887
   #11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48,
       active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384
   #12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508,
       auio=0xf5824c48, offset=41189376, flags=1) at file.h:254
   #13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48,
       offset=41189376) at /usr/src/sys/kern/sys_generic.c:288
   #14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec)
       at /usr/src/sys/kern/sys_generic.c:189
   #15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131
   #16 0xc0996db1 in Xint0x80_syscall ()
       at /usr/src/sys/i386/i386/exception.s:266
   #17 0x0033 in ?? ()
   Previous frame inner to this frame (corrupt stack?)
  
   --
   Alexandr Kovalenko
   http://uafug.org.ua/
   ___
   freebsd-sta...@freebsd.org mailing list
   http://lists.freebsd.org/mailman/listinfo/freebsd-stable
   To unsubscribe, send any mail to 
   freebsd-stable-unsubscr...@freebsd.org
  
 
 
 
  --
  Alexandr Kovalenko
  http://uafug.org.ua/
 
 
 
 
 -- 
 Alexandr Kovalenko
 http://uafug.org.ua/


pgpPkSqzkhr3F.pgp
Description: PGP signature

Re: stable/9 r225827 i386 panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero

2011-09-29 Thread Kostik Belousov

On Thu, Sep 29, 2011 at 04:12:19PM +0300, Alexandr Kovalenko wrote:
 2011/9/29 Kostik Belousov kostik...@gmail.com:
  On Thu, Sep 29, 2011 at 03:51:53PM +0300, Alexandr Kovalenko wrote:
  2011/9/29 Kostik Belousov kostik...@gmail.com:
   On Thu, Sep 29, 2011 at 03:47:19PM +0300, Alexandr Kovalenko wrote:
   On Thu, Sep 29, 2011 at 3:30 PM, Kostik Belousov kostik...@gmail.com 
   wrote:
On Thu, Sep 29, 2011 at 02:52:31PM +0300, Alexandr Kovalenko wrote:
Hello!
   
I'm running 9.0-BETA3 (r225827) and now rebuilding all my 1215 ports
(I've upgraded from 8.2). I'm getting panic. Is it known
problem/already fixed somewhere?
  Do you use custom kernel config ? Is there a chance you have 
  ZERO_COPY_SOCKETS
  option enabled ?
 
 Yes, ZERO_COPY_SOCKETS is there.
Ok, this is the cause. Remove it.

I asked for some additional data below, which you ignored, but I believe
that I will not see anything new there, after we found the ZERO_COPY_SOCKETS
in kernel config.

 
 
 
 
   
FreeBSD mile.xxx.ua 9.0-BETA3 FreeBSD 9.0-BETA3 #0 r225827: Wed Sep 
28
17:11:17 EEST 2011     r...@mile.xxx.ua:/usr/obj/usr/src/sys/mile-9
i386
   
Unread portion of the kernel message buffer:
panic: vm_page_unwire: page 0xc2a38dc8's wire count is zero
cpuid = 1
Uptime: 16h6m53s
Physical memory: 1904 MB
Dumping 367 MB: 352 336 320 304 288 272 256 240 224 208 192 176 160
144 128 112 96 80 64 48 32 16
   
#0  doadump (textdump=1) at pcpu.h:244
#1  0xc071e5cb in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:442
#2  0xc071e82b in panic (fmt=Variable fmt is not available.
) at /usr/src/sys/kern/kern_shutdown.c:607
#3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0)
    at /usr/src/sys/vm/vm_page.c:1905
Please do frame 2, then p/x *m and show the result.
  
   (kgdb) frame 2
   frame 3, sorry. p/x *(struct vm_page *)0xc2a38dc8 will do it as well.
 
  (kgdb) frame 3
  #3  0xc0966903 in vm_page_unwire (m=0xc2a38dc8, activate=0) at
  /usr/src/sys/vm/vm_page.c:1905
  1905                    panic(vm_page_unwire: page %p's wire count is
  zero, m);
  (kgdb) p/x *(struct vm_page *)0xc2a38dc8
  $1 = {pageq = {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b0}, listq =
  {tqe_next = 0xc2a38e10, tqe_prev = 0xc282a2b8}, left = 0x0, right =
  0x0, object = 0xc5725770, pindex = 0xbd3, phys_addr = 0x56a32000, md =
  {pv_list = {tqh_first = 0xc3cc6418, tqh_last = 0xc3cc641c},
      pat_mode = 0x6}, queue = 0x1, segind = 0x2, hold_count = 0x0,
  order = 0xb, pool = 0x0, cow = 0x0, wire_count = 0x0, aflags = 0x3,
  flags = 0x0, oflags = 0x0, act_count = 0x5, busy = 0x0, valid = 0xff,
  dirty = 0xff}
 
  Please show the output of p *(struct vm_object *)0xc5725770 from kgdb.
 
 
   #2  0xc071e82b in panic (fmt=Variable fmt is not available.) at
   /usr/src/sys/kern/kern_shutdown.c:607
   607             kern_reboot(bootopt);
   (kgdb) p/x *m
   No symbol m in current context.
  
  
#4  0xc0796b80 in vfs_vmio_release (bp=0xde8bcbf4)
    at /usr/src/sys/kern/vfs_bio.c:1638
#5  0xc0798813 in getnewbuf (vp=0xc6ea3550, slpflag=0, slptimeo=0,
    size=16384, maxsize=16384, gbflags=0) at 
/usr/src/sys/kern/vfs_bio.c:1949
#6  0xc0799f2a in getblk (vp=0xc6ea3550, blkno=2520, size=16384, 
slpflag=0,
    slptimeo=0, flags=Variable flags is not available.
) at /usr/src/sys/kern/vfs_bio.c:2788
#7  0xc079d49c in cluster_rbuild (vp=0xc6ea3550, filesize=44505088, 
lbn=2520,
    blkno=1209440, size=16384, run=Variable run is not available.
) at /usr/src/sys/kern/vfs_cluster.c:332
#8  0xc079e145 in cluster_read (vp=0xc6ea3550, filesize=44505088,
    lblkno=2520, size=16384, cred=0x0, totread=1024, seqcount=7,
    bpp=0xf5824b60) at /usr/src/sys/kern/vfs_cluster.c:254
#9  0xc0934cf5 in ffs_read (ap=0xf5824bac)
    at /usr/src/sys/ufs/ffs/ffs_vnops.c:514
#10 0xc09ccb92 in VOP_READ_APV (vop=0xc0aa6a80, a=0xf5824bac)
    at vnode_if.c:887
#11 0xc07c1120 in vn_read (fp=0xc5474508, uio=0xf5824c48,
    active_cred=0xc56a4d80, flags=1, td=0xc5b76b80) at vnode_if.h:384
#12 0xc076380e in dofileread (td=0xc5b76b80, fd=3, fp=0xc5474508,
    auio=0xf5824c48, offset=41189376, flags=1) at file.h:254
#13 0xc07639f5 in kern_preadv (td=0xc5b76b80, fd=3, auio=0xf5824c48,
    offset=41189376) at /usr/src/sys/kern/sys_generic.c:288
#14 0xc0763b0d in sys_pread (td=0xc5b76b80, uap=0xf5824cec)
    at /usr/src/sys/kern/sys_generic.c:189
#15 0xc09accf5 in syscall (frame=0xf5824d28) at subr_syscall.c:131
#16 0xc0996db1 in Xint0x80_syscall ()
    at /usr/src/sys/i386/i386/exception.s:266
#17 0x0033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
   
--
Alexandr Kovalenko
http://uafug.org.ua/
___
freebsd-sta...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo

Re: ia64 r225789 panic during make installworld: Bad buffer logic, remain = 0

2011-09-28 Thread Kostik Belousov

On Wed, Sep 28, 2011 at 04:27:39PM +0300, Jaakko Heinonen wrote:
 On 2011-09-28, Anton Shterenlikht wrote:
  KDB: stack backtrace:
  getenv with the following non-sleepable locks held:
  exclusive sleep mutex vnode interlock (vnode interlock) r = 0 
  (0xe00011950488) locked @ /usr/src/sys/fs/devfs/devfs_vnops.c:406
  
  etc. until a hang, requiring cold reset via MP.
 
 Someone is calling getenv with a vnode interlock held. You need to
 figure out the caller. Unfortunately the backtrace is missing above.
 
 As a temporary workaround you could comment the WITNESS_WARN() line in
 getenv() (sys/kern/kern_environment.c) but it is not a real fix.

I do not think that this is the real cause of the panic. Line 406
in devfs_vnops.c belongs to devfs_allocv(), and vnode interlock taken
there must be consumed by LK_INTERLOCK call to vget(). The getenv()
cannot be called from the vget() or two unlock calls between lines
406 and 409.

It seems there is something broken elsewere.


pgpP9hKdJwqiy.pgp
Description: PGP signature

Re: Choosing between DELAY(useconds) and pause()

2011-09-27 Thread Kostik Belousov

On Tue, Sep 27, 2011 at 10:39:44AM -0600, Julian Elischer wrote:
 On 9/27/11 4:12 AM, Gavin Atkinson wrote:
 On Mon, 2011-09-26 at 09:30 -0400, John Baldwin wrote:
 On Friday, September 23, 2011 11:21:06 am Gavin Atkinson wrote:
 On Thu, 2011-09-22 at 20:07 +0200, Hans Petter Selasky wrote:
 On Thursday 22 September 2011 19:55:23 David Somayajulu wrote:
 It appears that the pause() function cannot be used in driver functions
 which are invoked early in the boot process. Is there is a kernel api
 which a device driver can use to determine whether to use pause() or
 DELAY(), for delays which are say greater than 10hz - may be even 1 hz 
 ?
 Maybe you want to use something like this:
 
 if (cold)
   DELAY()
 else
   pause()
 
 In your code.
 Note that this still shouldn't be done in your suspend/resume paths, as
 cold isn't set there, however there also appears to be no guarantee
 that pause() will ever return (as you could be running after the timer
 has been suspended, or before it resumes).
 
 I'm not sure what the correct answer is for suspend/resume code.
 Hmmm, on x86 the timers are explicitly shutdown after the DEVICE_SUSPEND()
 pass over the tree and re-enabled before DEVICE_RESUME().  Perhaps this 
 has
 changed in HEAD though with the eventtimers stuff.  I do think it is best
 however, to use DELAY() in the suspend/resume path always regardless.
 I don't think head is any different from stable/8 in this respect - the
 same hack patch that fixes suspend/resume for me on head also fixes it
 on stable/8 (the patch basically fakes cold during USB
 suspend/resume).  See my email to -usb a few months ago:
 http://docs.FreeBSD.org/cgi/mid.cgi?alpine.LNX.2.00.1106041548370.26975
 
 I'd really like some guidance as to the correct solution to this, I have
 four separate laptops which fail out of the box on 8 and 9, but
 suspend/resume perfectly with this hack.
 
 code for timers should have a generally readable state that says if 
 they are useable or not, and we should test that instead of 'cold'

I think that this should be encapsulated into the usable function,
instead of being directly exposed to the driver authors.
With the my lack of imagination, I propose driver_delay() that
would do if (cold) ... inside.


pgp30errILlUx.pgp
Description: PGP signature

Re: truss

2011-09-19 Thread Kostik Belousov

On Mon, Sep 19, 2011 at 04:03:42PM +, Anton Yuzhaninov wrote:
 On Mon, 19 Sep 2011 15:00:31 + (UTC), Anton Yuzhaninov wrote:
 AY On Mon, 19 Sep 2011 15:58:02 +0300, Mikolaj Golub wrote:
  AY ktrace -i for truss sleep 5
  AY http://dl.dropbox.com/u/8798217/tmp/truss_ktrace2.txt
 MG 
 MG Although ptrace(PT_TRACE_ME,0,0,0) returned 0 the process did not stop 
 after
 MG execve() and wait4() in parent (which was actually waiting for this stop)
 MG returned only after the child exit. No I idea why so far :-).
 MG 
 AY 
 AY As I understand SIGTRAP used to stop child process after execve(), but
 AY this signal ignored:
 AY 
 AY citrin:~ sleep 300 
 AY citrin:~ procstat -i 1991 | fgrep TRAP
 AY  1991 sleepTRAP -I-
 AY 
 AY Under FreeBSD 8, where ptrace works for me, this signal is not ignored:
 AY x:~ sleep 300 
 AY x:~ procstat -i 78716 | fgrep TRAP
 AY 78716 sleepTRAP ---
 
 SIGTRAP is ignored by X window manager used by me, and this is inherited 
 across
 forks/execs up to the truss.
 
 IMHO truss should restore default signal handler for SIGTRAP.
 
 With this patch truss works for me:
 
 --- usr.bin/truss/main.c(revision 225504)
 +++ usr.bin/truss/main.c(working copy)
 @@ -255,6 +255,11 @@ main(int ac, char **av)
 
 if (trussinfo-pid == 0) {  /* Start a command ourselves */
 command = av;
 +   /*
 +* SIGTRUP used to stop traced process after execve
 +* un-ignore this signal (it can be ignored by parents)
 +*/
 +   signal(SIGTRAP, SIG_DFL);
 trussinfo-pid = setup_and_wait(command);
 signal(SIGINT, SIG_IGN);
 signal(SIGTERM, SIG_IGN);
This is quite a hack. The proper fix should go in kernel, otherwise
we cannot debug programs that decided to ignore SIGTRAP. The reason
there is that tdsendsignal() does nothing for attempt to deliver ignored
signal, and kern_execve() simply tries to send a signal to self.

BTW, it seems we might also have similar issues for threads that masks
SIGTRAP, now because issignal() does nothing in that case too. But lets
handle it by small steps.

Could you, please, test the change below ? For me, I still can truss(1)
or debug with gdb after the change applied. Does truss work for you
with only this change, without resetting SIGTRAP handler in truss process ?

commit 2ae586c039a55399edc3b34cd40410e0d690a16c
Author: Konstantin Belousov kos...@pooma.home
Date:   Tue Sep 20 00:25:07 2011 +0300

Do not deliver SIGTRAP on exec as the normal signal, use ptracestop()
on syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal()
do not want to deliver, and debugger never get a notification of exec.

Found by:   Anton Yuzhaninov cit...@citrin.ru

diff --git a/sys/kern/kern_exec.c b/sys/kern/kern_exec.c
index fe01142..4545848 100644
--- a/sys/kern/kern_exec.c
+++ b/sys/kern/kern_exec.c
@@ -777,16 +777,6 @@ interpret:
KNOTE_LOCKED(p-p_klist, NOTE_EXEC);
p-p_flag = ~P_INEXEC;
 
-   /*
-* If tracing the process, trap to the debugger so that
-* breakpoints can be set before the program executes.  We
-* have to use tdsignal() to deliver the signal to the current
-* thread since any other threads in this process will exit if
-* execve() succeeds.
-*/
-   if (p-p_flag  P_TRACED)
-   tdsignal(td, SIGTRAP);
-
/* clear fork but no exec flag, as we _are_ execing */
p-p_acflag = ~AFORK;
 
diff --git a/sys/kern/subr_syscall.c b/sys/kern/subr_syscall.c
index cb0d929..bba4479 100644
--- a/sys/kern/subr_syscall.c
+++ b/sys/kern/subr_syscall.c
@@ -204,9 +204,17 @@ syscallret(struct thread *td, int error, struct 
syscall_args *sa __unused)
 * is not the case, this code will need to be revisited.
 */
STOPEVENT(p, S_SCX, sa-code);
-   PTRACESTOP_SC(p, td, S_PT_SCX);
if (traced || (td-td_dbgflags  (TDB_EXEC | TDB_FORK)) != 0) {
PROC_LOCK(p);
+   /*
+* If tracing the execed process, trap to the debugger
+* so that breakpoints can be set before the program
+* executes.  If debugger requested tracing of syscall
+* returns, do it now too.
+*/
+   if (traced  ((td-td_dbgflags  TDB_EXEC) != 0 ||
+   (p-p_stops  S_PT_SCX) != 0))
+   ptracestop(td, SIGTRAP);
td-td_dbgflags = ~(TDB_SCX | TDB_EXEC | TDB_FORK);
PROC_UNLOCK(p);
}


pgpF2yup3qKFJ.pgp
Description: PGP signature

Re: cvsup broken on amd64?

2011-09-18 Thread Kostik Belousov

On Sun, Sep 18, 2011 at 12:22:53PM +0200, Oliver Lehmann wrote:
 
 Adrian Chadd adr...@freebsd.org wrote:
 
 So I've taken a look at the csup source.
 
 [...]
 
 What about this patch:
 
 [...]
 
 Oliver, would you please try that?
 
 I have a problem with cvsup, not csup - Alexander mentioned a csup problem.

Did you saw the message with the patch for tzcode I mailed to you ?


pgpxG7jeO8J6E.pgp
Description: PGP signature

Re: Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW)

2011-09-18 Thread Kostik Belousov

On Sun, Sep 18, 2011 at 01:56:50PM +0200, Jilles Tjoelker wrote:
 On Wed, Sep 14, 2011 at 11:04:56PM +0300, Kostik Belousov wrote:
  tzload() allocates ~80KB for the local variables. The backtrace you provided
  shows the nested call to tzload(), so there is total 160KB of the stack
  space consumed.
 
  By default, stack for the amd64 thread is 4MB, that should be plenty. This
  is not the case for ezm3. Possibly, stunnel also reduces the size of the
  thread stack.
 
  Please, try the patch below. I did not tested it, only compiled. I see
  that now tzload allocates only ~300 bytes on the stack.
 
 80KB seems quite a lot indeed, good to bring it down.
 
  diff --git a/contrib/tzcode/stdtime/localtime.c 
  b/contrib/tzcode/stdtime/localtime.c
  index 80b70ac..55d55e0 100644
  --- a/contrib/tzcode/stdtime/localtime.c
  +++ b/contrib/tzcode/stdtime/localtime.c
 [snip]
  @@ -406,16 +409,24 @@ register const intdoextend;
  ** to hold the longest file name string that the implementation
  ** guarantees can be opened.
  */
  -   charfullname[FILENAME_MAX + 1];
  +   char*fullname;
  +
  +   fullname = malloc(FILENAME_MAX + 1);
  +   if (fullname == NULL)
  +   goto out;
   
  if (name[0] == ':')
  ++name;
  doaccess = name[0] == '/';
  if (!doaccess) {
  -   if ((p = TZDIR) == NULL)
  +   if ((p = TZDIR) == NULL) {
  +   free(fullname);
  return -1;
  -   if ((strlen(p) + 1 + strlen(name) + 1) = sizeof 
  fullname)
  +   }
  +   if ((strlen(p) + 1 + strlen(name) + 1) = sizeof 
  fullname) {
 
 This sizeof is now the sizeof of a pointer. The comparison should be
 against FILENAME_MAX + 1 instead.
 
 Alternatively, the name could be created using asprintf().

You are right. I fixed the defect.
Updated patch below.

diff --git a/contrib/tzcode/stdtime/localtime.c 
b/contrib/tzcode/stdtime/localtime.c
index 80b70ac..b1981b6 100644
--- a/contrib/tzcode/stdtime/localtime.c
+++ b/contrib/tzcode/stdtime/localtime.c
@@ -380,13 +380,16 @@ register const intdoextend;
int fid;
int stored;
int nread;
+   int res;
union {
struct tzhead   tzhead;
charbuf[2 * sizeof(struct tzhead) +
2 * sizeof *sp +
4 * TZ_MAX_TIMES];
-   } u;
+   } *u;
 
+   u = NULL;
+   res = -1;
sp-goback = sp-goahead = FALSE;
 
/* XXX The following is from OpenBSD, and I'm not sure it is correct */
@@ -406,16 +409,24 @@ register const intdoextend;
** to hold the longest file name string that the implementation
** guarantees can be opened.
*/
-   charfullname[FILENAME_MAX + 1];
+   char*fullname;
+
+   fullname = malloc(FILENAME_MAX + 1);
+   if (fullname == NULL)
+   goto out;
 
if (name[0] == ':')
++name;
doaccess = name[0] == '/';
if (!doaccess) {
-   if ((p = TZDIR) == NULL)
+   if ((p = TZDIR) == NULL) {
+   free(fullname);
return -1;
-   if ((strlen(p) + 1 + strlen(name) + 1) = sizeof 
fullname)
+   }
+   if (strlen(p) + 1 + strlen(name) = FILENAME_MAX) {
+   free(fullname);
return -1;
+   }
(void) strcpy(fullname, p);
(void) strcat(fullname, /);
(void) strcat(fullname, name);
@@ -426,37 +437,45 @@ register const intdoextend;
doaccess = TRUE;
name = fullname;
}
-   if (doaccess  access(name, R_OK) != 0)
+   if (doaccess  access(name, R_OK) != 0) {
+   free(fullname);
return -1;
-   if ((fid = _open(name, OPEN_MODE)) == -1)
+   }
+   if ((fid = _open(name, OPEN_MODE)) == -1) {
+   free(fullname);
return -1;
+   }
if ((_fstat(fid, stab)  0) || !S_ISREG(stab.st_mode)) {
+   free(fullname);
_close(fid);
return -1;
}
}
-   nread = _read(fid, u.buf, sizeof u.buf);
+   u = malloc(sizeof(*u));
+   if (u == NULL)
+   goto out

Re: cvsup broken on amd64?

2011-09-18 Thread Kostik Belousov

On Sun, Sep 18, 2011 at 02:46:24PM +0200, Oliver Lehmann wrote:
 
 Kostik Belousov kostik...@gmail.com wrote:
 
 Did you saw the message with the patch for tzcode I mailed to you ?
 
 Mmmh... no didn't reached  my mailbox - can you resend it please?

See the Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW) thread
on the current@, where you are explicitely Cc:ed. I posted updated patch
a minute ago.


pgpbmqafpbDxl.pgp
Description: PGP signature

Re: Crashes in world built w/ clang: FP registers?

2011-09-17 Thread Kostik Belousov

On Fri, Sep 16, 2011 at 10:34:40PM -0500, Jason Harmening wrote:
 Hi everyone,
 
 Using clang as the default compiler, the kernel and drivers will work
 fine, but a lot of programs in the base system and ports will crash w/
 SIGBUS.  In fact, so much of the stuff in the chroot'ed world will
 crash (everything from csh to gcc) that it's basically unusable.  I
 finally got around to building w/ debug symbols, and ran gdb on a
 coredump generated while I was trying to use tab completion in csh:
 
 (gdb) bt
 #0  tw_collect (command=dwarf2_read_address: Corrupted DWARF expression.)
 at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:1308
 #1  0x0042777b in t_search (word=Unhandled dwarf expression opcode 
 0x0)
 at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:1725
 #2  0x00426829 in tenematch (inputline=Variable inputline is
 not avail   able.)
 at /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c:301
 #3  0x0043545d in Inputl ()
 at /usr/src/bin/csh/../../contrib/tcsh/ed.inputl.c:415
 #4  0x00417a90 in readc (wanteof=Variable wanteof is not available.)
 at /usr/src/bin/csh/../../contrib/tcsh/sh.lex.c:1653
 #5  0x00416f37 in lex (hp=Variable hp is not available.)
 at /usr/src/bin/csh/../../contrib/tcsh/sh.lex.c:162
 #6  0x00405afb in process (catch=Unhandled dwarf expression opcode 
 0x0)
 at /usr/src/bin/csh/../../contrib/tcsh/sh.c:1922
 #7  0x00404b51 in main (argc=Variable argc is not available.)
 at /usr/src/bin/csh/../../contrib/tcsh/sh.c:1289
 
 gdb) disas
 Dump of assembler code for function tw_collect:
 0x004288b0 tw_collect+0:  push   %rbp
 0x004288b1 tw_collect+1:  mov%rsp,%rbp
 0x004288b4 tw_collect+4:  push   %r15
 0x004288b6 tw_collect+6:  push   %r14
 0x004288b8 tw_collect+8:  push   %r13
 0x004288ba tw_collect+10: push   %r12
 0x004288bc tw_collect+12: push   %rbx
 0x004288bd tw_collect+13: sub$0x2e8,%rsp
 0x004288c4 tw_collect+20: mov%r9,-0x308(%rbp)
 0x004288cb tw_collect+27: mov%r8,-0x300(%rbp)
 0x004288d2 tw_collect+34: mov%rcx,-0x2f8(%rbp)
 0x004288d9 tw_collect+41: mov%rdx,-0x2f0(%rbp)
 0x004288e0 tw_collect+48: mov%esi,-0x2e8(%rbp)
 0x004288e6 tw_collect+54: mov%edi,-0x2e4(%rbp)
 0x004288ec tw_collect+60: movl   $0x0,-0x1d4(%rbp)
 0x004288f6 tw_collect+70: movaps 0x23115b(%rip),%xmm0
 # 0x6   59a58
 reslab+48
This is actually 0x659a58 reslab+48
movaps tried to load %xmm0 from the unaligned address, which is forbidden
and causes #GP.

I have no idea why clang generates unaligned loads.
 0x004288fd tw_collect+77: lea-0x2(%rdi),%eax
 0x00428900 tw_collect+80: mov%eax,-0x2e0(%rbp)
 0x00428906 tw_collect+86: test   %edi,%edi
 0x00428908 tw_collect+88: movaps %xmm0,-0x210(%rbp)
 0x0042890f tw_collect+95: sete   %al
 ---Type return to continue, or q return to quit---q
 Quit
 (gdb) info line tw.parse.c:1308
 Line 1308 of /usr/src/bin/csh/../../contrib/tcsh/tw.parse.c
starts at address 0x4288f6 tw_collect+70
and ends at 0x4288fd tw_collect+77.
 
 
 Looks like it's crashing as soon as it tries to use the XMM registers.
  I'm not sure if all of the crashes I'm getting are like this one, but
 I was surprised to see FP registers in code like this.
 
 I'm using march=corei7 and -O2 for both world and kernel, but using
 march=nocona or just leaving out CPUTYPE has no effect (actual CPU is
 Nehalem Xeon 5520)
 Here's the relevant part of make.conf for completeness:
 
 .if !defined(CC) || ${CC} == cc
 CC=clang
 .endif
 .if !defined(CXX) || ${CXX} == c++
 CXX=clang++
 .endif
 .if !defined(CPP) || ${CPP} == cpp
 CPP=clang -E
 .endif
 NO_WERROR=
 WERROR=
 NO_FSCHG=
 CPUTYPE?=corei7
 CFLAGS= -O2 -pipe
 COPTFLAGS= -O2 -pipe
 
 Any thoughts? Is there some simple fix for this I'm missing?
 
 Thanks,
 Jason
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


pgpLoLcvYvQtt.pgp
Description: PGP signature

Re: Segfault in libthr.so on 9.0-BETA2 (with stunnel FWIW)

2011-09-14 Thread Kostik Belousov

On Wed, Sep 14, 2011 at 02:36:07PM +0200, Jeremie Le Hen wrote:
 Hi list,
 
 I've recently migrated my services from a box running 8.1-STABLE to
 another one running 9.0-BETA2.
 
 I run stunnel 4.28 on 8.1-STABLE, and it has run flawlessly so far.  I
 compiled manually this very version on 9.0-BETA2.  But I get the
 following segfault:
 
 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 803008c00 (LWP 100496/stunnel)]
 0x00080110d359 in gmtime_r () from /lib/libc.so.7
 (gdb) thread
 [Current thread is 3 (Thread 803008c00 (LWP 100496/stunnel))]
 (gdb) bt
 #0  0x00080110d359 in gmtime_r () from /lib/libc.so.7
 #1  0x00080110cdde in gmtime_r () from /lib/libc.so.7
 #2  0x00080110dab4 in gmtime_r () from /lib/libc.so.7
 #3  0x00080110dcc8 in gmtime_r () from /lib/libc.so.7
 #4  0x000800e1d9e8 in pthread_once () from /lib/libthr.so.3
 #5  0x00080110ca9f in timegm () from /lib/libc.so.7
 #6  0x000805dff8d9 in OPENSSL_gmtime () from 
 /usr/local/lib/libcrypto.so.7
 #7  0x000805e74631 in ASN1_UTCTIME_adj () from 
 /usr/local/lib/libcrypto.so.7
 #8  0x000805e9462d in X509_time_adj_ex () from 
 /usr/local/lib/libcrypto.so.7
 #9  0x000805e9478c in X509_cmp_time () from 
 /usr/local/lib/libcrypto.so.7
 #10 0x000805e9496d in internal_verify () from 
 /usr/local/lib/libcrypto.so.7
 #11 0x000805e95f46 in X509_verify_cert () from 
 /usr/local/lib/libcrypto.so.7
 #12 0x000805b7f4c8 in ssl_verify_cert_chain () from 
 /usr/local/lib/libssl.so.7
 #13 0x000805b5d6e3 in ssl3_get_client_certificate () from 
 /usr/local/lib/libssl.so.7
 #14 0x000805b612bc in ssl3_accept () from /usr/local/lib/libssl.so.7
 #15 0x00406f6e in init_ssl (c=0x803093000) at client.c:329
 #16 0x004069a6 in do_client (c=0x803093000) at client.c:202
 #17 0x0040676b in run_client (c=0x803093000) at client.c:150
 #18 0x004066cf in client (arg=0x803093000) at client.c:123
 #19 0x000800e18224 in pthread_getprio () from /lib/libthr.so.3
 #20 0x in ?? ()
 
 
 Note that I tried with the newest version of stunnel, it crashes at the
 same place.  I also tried libssl.so both from the base system and from
 the ports, same thing.

You need to compile both libc and libthr with debugging symbols and
do a backtrace with such libraries.


pgpvXUeoZClco.pgp
Description: PGP signature

1 2 3 >

1 - 100 of 284 matches

Mail list logo