On Sun, 08 May 2011 21:43:12 CST, Bdale Garbee writes:
>> just did that, too: no change, same lockup. do you want me to add a tad
>> of printf debugging to the code in question?
>
>Sure.  The situation is basically that it would be a lot of work that I 
>don't have time for in the next few weeks to try and get set up to work
>on this myself on sparc.  So anything you can do to help figure out what
>the problem is would be greatly appreciated!

sorry for the delay - but i've got a more precise diagnosis now and a somewhat
inelegant fix, too.

for some reason, SYS_clone with just the CLONE_IO option doesn't return 0 to
the newly created process/thread - at least on sparc. (the clone manpage doesn't
say anything about the return value for the new process.)

instead, the child receives the pid of the original process as syscall result. 
but dump expects clone to behave exactly like fork, hence gets badly confused 
by these two  apparent parent processes.

here's a patch that just adds the relevant printfs:
--- tape.c.orig	2011-05-17 21:52:02.000000000 +1000
+++ tape.c.plusdebug	2011-05-17 21:54:07.000000000 +1000
@@ -795,7 +795,24 @@
 pid_t
 fork_clone_io(void)
 {
-	return syscall(SYS_clone, CLONE_ARGS);
+   /* az */
+   fprintf(stderr,"pid %d before clone, ppid %d, cloneargs 0x%0x\n",
+	   getpid(),getppid(),CLONE_ARGS);
+   int res;
+
+   res=syscall(SYS_clone, CLONE_ARGS);
+   fprintf(stderr,"pid %d post clone, clone res %d, ppid %d\n",
+	   getpid(),res,getppid());
+
+   /* as per clone call manpage: caching! */
+   fprintf(stderr,"pid %d, getpid syscall says: %d\n",
+	   getpid(),syscall(SYS_getpid));
+   
+   fprintf(stderr,"last pid %d\n",
+	   getpid());
+
+   return res;
+
 }
 #endif
 #endif
the output: 

./dump-0.4b43//dump/dump 0f /extra/az/stuff /dev/md0
  DUMP: Date of this level 0 dump: Tue May 17 21:35:21 2011
  DUMP: Dumping /dev/md0 (/) to /extra/az/stuff
  DUMP: Label: rootfs
  DUMP: Writing 10 Kilobyte records
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 50391 blocks.
pid 29785 before clone, ppid 22540, cloneargs 0x80000014
pid 29785 post clone, clone res 29787, ppid 22540
pid 29785, getpid syscall says: 29785
last pid 29785
pid 29785 post clone, clone res 29785, ppid 29785
pid 29785, getpid syscall says: 29787
last pid 29785

...then the 29787 child goes into a wait() loop, 100% cpu etc.

furthermore, getpid in the child lies: as per the clone manpage there is 
a possibility of getpid cache corruption, and indeed the recommended fallback
(doing a SYS_getpid syscall) returns the right information. (interestingly 
the getpid cache is not updated, and so even later processes created in the next
run of clone, in enslave, all report the original process' pid...)

but at least clone updates ppid properly: my slightly hacky fix changes the 
fork_clone_io() function to collect the ppid before syscalling clone, and if
there's no ppid change post-clone we're in the parent - return the child's pid, 
otherwise return 0 to emulate fork.

here's the patch that changes fork_clone_io()'s behaviour, still with the 
debug printfs present. with that applied, dump works fine on sparc.

--- tape.c.orig	2011-05-17 21:52:02.000000000 +1000
+++ tape.c.debugandhack	2011-05-17 21:56:18.000000000 +1000
@@ -795,7 +795,28 @@
 pid_t
 fork_clone_io(void)
 {
-	return syscall(SYS_clone, CLONE_ARGS);
+   /* az */
+   fprintf(stderr,"pid %d before clone, ppid %d, cloneargs 0x%0x\n",
+	   getpid(),getppid(),CLONE_ARGS);
+   pid_t res,parent;
+   parent=getppid();		/* az hackety hack... */
+
+   res=syscall(SYS_clone, CLONE_ARGS);
+   fprintf(stderr,"pid %d post clone, clone res %d, ppid %d\n",
+	   getpid(),res,getppid());
+
+   /* as per clone call manpage: caching! */
+   fprintf(stderr,"pid %d, getpid syscall says: %d\n",
+	   getpid(),syscall(SYS_getpid));
+   
+   fprintf(stderr,"last pid %d\n",
+	   getpid());
+
+   /* az: clone manpage doesn't say jack about what the 
+      child receives, but it's NOT ZERO on sparc. however, it seems the
+      ppid is updated and trustworthy - so let's use that... */
+   return parent==getppid()?res:0;
+
 }
 #endif
 #endif
i'd suggest maybe involving dave miller or one of the other kernel/sparc gurus,
to determine why sparc differs in its handling of SYS_clone with only the 
CLONE_IO option
and whether it's SYS_clone at fault or dump just happens to work on other
platforms because SYS_clone might behave more closely like fork there...

regards
az


-- 
+ Alexander Zangerl + DSA 42BD645D + (RSA 5B586291)
If debugging is the process of removing bugs, then programming must be the 
process of putting them in. -- Dykstra

Attachment: signature.asc
Description: Digital Signature

Reply via email to