On Sun, 08 May 2011 21:43:12 CST, Bdale Garbee writes: >> just did that, too: no change, same lockup. do you want me to add a tad >> of printf debugging to the code in question? > >Sure. The situation is basically that it would be a lot of work that I >don't have time for in the next few weeks to try and get set up to work >on this myself on sparc. So anything you can do to help figure out what >the problem is would be greatly appreciated!
sorry for the delay - but i've got a more precise diagnosis now and a somewhat inelegant fix, too. for some reason, SYS_clone with just the CLONE_IO option doesn't return 0 to the newly created process/thread - at least on sparc. (the clone manpage doesn't say anything about the return value for the new process.) instead, the child receives the pid of the original process as syscall result. but dump expects clone to behave exactly like fork, hence gets badly confused by these two apparent parent processes. here's a patch that just adds the relevant printfs:
--- tape.c.orig 2011-05-17 21:52:02.000000000 +1000
+++ tape.c.plusdebug 2011-05-17 21:54:07.000000000 +1000
@@ -795,7 +795,24 @@
pid_t
fork_clone_io(void)
{
- return syscall(SYS_clone, CLONE_ARGS);
+ /* az */
+ fprintf(stderr,"pid %d before clone, ppid %d, cloneargs 0x%0x\n",
+ getpid(),getppid(),CLONE_ARGS);
+ int res;
+
+ res=syscall(SYS_clone, CLONE_ARGS);
+ fprintf(stderr,"pid %d post clone, clone res %d, ppid %d\n",
+ getpid(),res,getppid());
+
+ /* as per clone call manpage: caching! */
+ fprintf(stderr,"pid %d, getpid syscall says: %d\n",
+ getpid(),syscall(SYS_getpid));
+
+ fprintf(stderr,"last pid %d\n",
+ getpid());
+
+ return res;
+
}
#endif
#endif
the output: ./dump-0.4b43//dump/dump 0f /extra/az/stuff /dev/md0 DUMP: Date of this level 0 dump: Tue May 17 21:35:21 2011 DUMP: Dumping /dev/md0 (/) to /extra/az/stuff DUMP: Label: rootfs DUMP: Writing 10 Kilobyte records DUMP: mapping (Pass I) [regular files] DUMP: mapping (Pass II) [directories] DUMP: estimated 50391 blocks. pid 29785 before clone, ppid 22540, cloneargs 0x80000014 pid 29785 post clone, clone res 29787, ppid 22540 pid 29785, getpid syscall says: 29785 last pid 29785 pid 29785 post clone, clone res 29785, ppid 29785 pid 29785, getpid syscall says: 29787 last pid 29785 ...then the 29787 child goes into a wait() loop, 100% cpu etc. furthermore, getpid in the child lies: as per the clone manpage there is a possibility of getpid cache corruption, and indeed the recommended fallback (doing a SYS_getpid syscall) returns the right information. (interestingly the getpid cache is not updated, and so even later processes created in the next run of clone, in enslave, all report the original process' pid...) but at least clone updates ppid properly: my slightly hacky fix changes the fork_clone_io() function to collect the ppid before syscalling clone, and if there's no ppid change post-clone we're in the parent - return the child's pid, otherwise return 0 to emulate fork. here's the patch that changes fork_clone_io()'s behaviour, still with the debug printfs present. with that applied, dump works fine on sparc.
--- tape.c.orig 2011-05-17 21:52:02.000000000 +1000
+++ tape.c.debugandhack 2011-05-17 21:56:18.000000000 +1000
@@ -795,7 +795,28 @@
pid_t
fork_clone_io(void)
{
- return syscall(SYS_clone, CLONE_ARGS);
+ /* az */
+ fprintf(stderr,"pid %d before clone, ppid %d, cloneargs 0x%0x\n",
+ getpid(),getppid(),CLONE_ARGS);
+ pid_t res,parent;
+ parent=getppid(); /* az hackety hack... */
+
+ res=syscall(SYS_clone, CLONE_ARGS);
+ fprintf(stderr,"pid %d post clone, clone res %d, ppid %d\n",
+ getpid(),res,getppid());
+
+ /* as per clone call manpage: caching! */
+ fprintf(stderr,"pid %d, getpid syscall says: %d\n",
+ getpid(),syscall(SYS_getpid));
+
+ fprintf(stderr,"last pid %d\n",
+ getpid());
+
+ /* az: clone manpage doesn't say jack about what the
+ child receives, but it's NOT ZERO on sparc. however, it seems the
+ ppid is updated and trustworthy - so let's use that... */
+ return parent==getppid()?res:0;
+
}
#endif
#endif
i'd suggest maybe involving dave miller or one of the other kernel/sparc gurus, to determine why sparc differs in its handling of SYS_clone with only the CLONE_IO option and whether it's SYS_clone at fault or dump just happens to work on other platforms because SYS_clone might behave more closely like fork there... regards az
-- + Alexander Zangerl + DSA 42BD645D + (RSA 5B586291) If debugging is the process of removing bugs, then programming must be the process of putting them in. -- Dykstra
signature.asc
Description: Digital Signature

