Re: kernel panic in gsync_wait

2016-11-03 Thread Brent W. Baccala
On Thu, Nov 3, 2016 at 2:18 PM, Samuel Thibault 
wrote:

> Brent W. Baccala, on Thu 03 Nov 2016 14:12:41 -1000, wrote:
> > I suspect that this ultimately affects just about every program on a
> > Hurd system.
>
> Sure, it's used internally by glibc. But see the commit I made to
> gnumach: that makes gnumach return an error. glibc thus doesn't actually
> wait, just spinlock, i.e. behave correctly, just not so efficiently.
>

I see... so there must be fallback code (option 2 on my list); I just
haven't found it.

Where is KERN_FAILURE handled in user space?

agape
brent


Re: kernel panic in gsync_wait

2016-11-03 Thread Samuel Thibault
Brent W. Baccala, on Thu 03 Nov 2016 14:12:41 -1000, wrote:
> I suspect that this ultimately affects just about every program on a
> Hurd system.

Sure, it's used internally by glibc. But see the commit I made to
gnumach: that makes gnumach return an error. glibc thus doesn't actually
wait, just spinlock, i.e. behave correctly, just not so efficiently.

Samuel



Re: kernel panic in gsync_wait

2016-11-03 Thread Brent W. Baccala
Hi -

I've been trying to figure what to do with the gsync code.  It causes
undefined behavior and occasional kernel panics when rpctrace is used on
gsync_wait and gsync_wake calls.  I suspect that this ultimately affects
just about every program on a Hurd system.  Even if a program doesn't call
locking primitives directly, they're used by the C library, right?

On Sun, Oct 30, 2016 at 10:16 PM, Kalle Olavi Niemitalo  wrote:

> "Brent W. Baccala"  writes:
>
> > Even if I'm right about the nature of this bug, I don't understand
> gnumach
> > well enough to know how a task should access another task's memory.
>
> vm_copy apparently supports such access; code from there could be
> reused.  But if rpctrace uses gsync_wait on the address space of
> another task and the page has been paged out, then the call could
> end up blocking for the pager, and I don't think you want that.
>

I studied vm_copy and came up with some code, but gsync_wake needs to
actually modify the memory location, so setting up a copy doesn't work.
Plus, vm_copy makes a mapping that is visible to user space and leaves the
user space code responsible for deallocating it.

As for blocking for the pager, I think the code does that already.  As
innocuous as this line of code looks:

 *(unsigned int *)addr = val;

it can trigger a page fault and block the RPC, right?

Here's what I think gsync_wait and gsync_wake need to do if their task
argument isn't the current task:

1. lookup the address in the other task's pagemap, flag the map entry
'shared', and insert a copy of it into the kernel pagemap,

2. access the memory location, triggering a page fault which actually
creates the physical mapping, since Mach doesn't create physical maps until
required, then

3a. either remove the pagemap entry from the kernel map when we're done, or
3b. leave it for future use.

If we choose option (3a), then every wrapped call to gsync_{wait,wake} will
trigger a page fault.  If we choose option (3b), then step 1 needs to be
modified to search the kernel pagemap to see if it already has a copy of
the map entry in question.  I have to think some more to figure how that
mapping ultimately gets removed, and of course there's a memory exhaustion
issue if we add lots of new entries to the kernel map on a 32-bit system.

Another possibility would be to reject the RPC back to user space, and put
the old locking code back into glibc as a fallback.

All this brings us back to...

On Sun, Oct 30, 2016 at 10:16 PM, Kalle Olavi Niemitalo  wrote:

>
> Could gsync_wait be removed from gnumach.defs and replaced with
> only a trap that does not take a task_t parameter and cannot be
> intercepted by rpctrace?
>

I don't think we have anything else quite like gsync - kernel code that
directly accesses user space memory in a different task from the one that
trapped.  Our choices seem to be:

1. add a lot of new complexity to the gsync kernel routines
2. add fallback code to glibc
3. add a new system call as Kalle proposed

Comments?

agape
brent


Re: two more rpctrace patches

2016-11-03 Thread Brent W. Baccala
On Wed, Nov 2, 2016 at 2:39 PM, Brent W. Baccala 
wrote:

> On Wed, Nov 2, 2016 at 2:26 PM, Kalle Olavi Niemitalo  wrote:
>
>>
>> Look at how the commit messages in hurd.git at Savannah (not at
>> Debian) are formatted.  "make dist" runs gitlog-to-changelog,
>> which generates a ChangeLog file from those.
>>
>
> OK, I think I see what you want.
>

Was that last patch in the desired format?

Would you like me to reformat the commit messages on the last ten rpctrace
patches and resubmit them?

agape
brent


Re: spin gsync patch

2016-11-03 Thread Svante Signell
On Wed, 2016-11-02 at 09:52 +0100, Svante Signell wrote:
> On Tue, 2016-11-01 at 16:20 +0100, Samuel Thibault wrote:
> > 
> > Svante Signell, on Tue 01 Nov 2016 16:19:19 +0100, wrote:
> > > 
> > > 
> > > I have now rebuilt gcc-2.24-6 without hurd-i386/tg-libpthread-gsync-
> > > spin.diff
> > > and do not experience any filesystem corruptions when sync is issued for
> > > the
> > > patch target in gcc-6.
> > Ok, so that confirms that it's the culprit.
> > 
> > > 
> > > 
> > > I'll rebuild with your updated patch ASAP. BBL.
> > Thanks!
> > Samuel
> With the revised patch the filesystem corruption is no longer, yay!
> Back to porting and debugging gccgo for GNU/Hurd...

Sorry to bring bad news. The filesystem corruption is back, but not as bad as
before :( And when disks are checked the date is again set to: Tue Apr 13
23:07:40 2032. Going back to the version without the tg-libpthread-gsync-
spin.diff patch.