Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer

On 2/14/12 3:51 PM, Jan Mikkelsen wrote:


Coming back to the multicore issue:

The problem when a core fails is that it has affected more than its own state. 
It will be holding locks on shared resources and may have corrupted shared 
memory or asked a device to do the wrong thing. By the time you detect a fault 
in a core, it is too late. Checkpointing to main memory means that you need to 
be able to roll back to a checkpoint, and replay operations you know about. 
That involves more that CPU core state, that includes process file and device 
state.

I think that/s more or less what I was saying but with more concrete 
examples.
and yes I rememebr the tandem boxes from computer shows in Perth and 
Sydney, but never saw one in the field.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Jim Bryant
Mirrored SMP?  Even NonStops require a supervisory CPU subsystem to 
manage what is working or not.


SMP itself would have to be totally rethought.

My suggestion is to study the examples of NonStop and Guardian-90.

Julian Elischer wrote:

On 2/14/12 6:23 AM, Maninya M wrote:

For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific 
intervals
of time in main memory. Once a core fails, its previous state is 
retrieved

from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make 
it do

this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.

This question has always intrigued me, because I'm always amazed
that people actually try.
From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"?  do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices,  but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to 
"freebsd-hackers-unsubscr...@freebsd.org"




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to 
"freebsd-hackers-unsubscr...@freebsd.org"

.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Jim Bryant



Brandon Falk wrote:

On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
  

On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:


On 2/14/12 6:23 AM, Maninya M wrote:
  

For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific intervals
of time in main memory. Once a core fails, its previous state is retrieved
from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make it do
this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.


This question has always intrigued me, because I'm always amazed
that people actually try.
 From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"?  do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices,  but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have

  

How about core redundancy ? effectively this would reduce the amount of
available cores in half in you spread a process to run on two cores at
the same time but with an option to adjust this per process etc... I
don't see it as unfeasable.




The overhead for all of the error checking and redundancy makes this idea pretty
impractical. You'd have to have 2 cores to do the exact same thing, then some
'master' core that makes sure they're doing the right stuff, and if you really
want to think about it... what if the core monitoring the cores fails... there's
a threshold of when redundancy gets pointless.

Perhaps I'm missing out on something, but you can't check the checker (without
infinite redundancy).

Honestly, if you're worried about a core failing, please take your server
cluster out of the 1000 deg C forge.

-Brandon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

  
Don't forget that cache would have to be redundant too.  The redundant 
cores must not share an on-die cache.


Oh, and the real biggie.  What about the chipset and busses???  
Those would NOT be redundant.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Rayson Ho
On Tue, Feb 14, 2012 at 6:01 PM, Julian Elischer  wrote:
> True, but you can't guarantee that a cpu is going to fail in a way that you
> can detect like that. what if the clock just stops..

The question is, are we planning to handle >95% of the errors for >99%
of the hardware we run on, or are we really planning to spend years
trying to design something that would require special hardware
support?

On the zSeries mainframe, the instructions are executed in locked
steps on the redundant instruction pipeline, and if the results don't
match, the instruction is re-executed again. This happens on every
load and store.

Now, if you want software to do the same thing, you will need to
somehow checkpoint the state of not only the processor, but the memory
as well, or else if the bad processor stores something to memory you
will still get corrupted data. Not only that the kernel becomes very
complicated, it would make the system very slow. And what if the
checkpointing code is executed by faulty processors??

IIRC, processors & disks don't usually just fail. That's the whole
idea behind SMART, and Fault Management in Solaris & other kernels.

http://hub.opensolaris.org/bin/view/Community+Group+fm/

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



> I believe that even those systems that
> support cpu deactivation on
> error only catch some percentage of the problems, and that sometimes it was
> more of
> "bring up the system without cpu X after it all crashed in flames".
>
> tandem and other systems in the old day s used to be able to cope with dying
> cpus pretty well
> but they had support from to to bottom and the software was written with
> 'clustering' in mind.
>
>
>
>
>
>
>> Rayson
>>
>> =
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>>
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 "freebsd-hackers-unsubscr...@freebsd.org"

>>> ___
>>> freebsd-hackers@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>> To unsubscribe, send any mail to
>>> "freebsd-hackers-unsubscr...@freebsd.org"
>>
>>
>>
>
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


RE: OS support for fault tolerance

2012-02-14 Thread Devin Teske


> -Original Message-
> From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
> hack...@freebsd.org] On Behalf Of Julian Elischer
> Sent: Tuesday, February 14, 2012 3:02 PM
> To: Rayson Ho
> Cc: Maninya M; freebsd-hackers@freebsd.org
> Subject: Re: OS support for fault tolerance
> 
> On 2/14/12 9:27 AM, Rayson Ho wrote:
> > On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer
wrote:
> >> but I'm interested in any answers people may have
> > The way other OSes handle this is by detecting any abnormal amounts of
> > faults (sometimes it's not the fault of the hardware - eg. when a
> > partical from the outerspace hits a core and flips the bit), then the
> > disable the core(s).
> >
> > Solaris&  mainframe (z/OS) handle it this way, but you should google
> > and find more info since I don't remember all the details.
> >
> > Also, see this presentation: "Getting to know the Solaris Fault
> > Management Architecture (FMA)":
> >
> http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation
> .pdf
> True, but you can't guarantee that a cpu is going to fail in a way
> that you can detect like that.
> what if the clock just stops..  I believe that even those systems that
> support cpu deactivation on
> error only catch some percentage of the problems, and that sometimes
> it was more of
> "bring up the system without cpu X after it all crashed in flames".
> 
> tandem and other systems in the old day s used to be able to cope with
> dying cpus pretty well
> but they had support from to to bottom and the software was written
> with 'clustering' in mind.
> 

Nowadays NEC has a their sixth-generation "Fault Tolerant (FT) Series" servers
which are pretty much like the tandem servers.

We got a live demo of [simulated] CPU failure and the system kept chugging
along.

But as Julian says, it's not guaranteed that the CPU will always fail in a
predictable way (however, NEC has produced a VERY nice redundant package with
256-bit backplane to keep everything nice and lock-step).
-- 
Devin

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Jan Mikkelsen

On 15/02/2012, at 3:57 AM, Julian Elischer wrote:

> On 2/14/12 6:23 AM, Maninya M wrote:
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific intervals
>> of time in main memory. Once a core fails, its previous state is retrieved
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>> 
>> I read that the OS tolerates faults in large servers. I need to make it do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with FreeBSD.
>> Any pointers to good sources about modifying the source-code of FreeBSD
>> would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.
> 
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
> 
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
> 
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
> 
> but I'm interested in any answers people may have

Back in the '90s I spent a bunch of time with looking at and using systems that 
dealt with this kind of failure.

There are two basic approaches: With software support and without. The basic 
distinction is what the hardware can do when something breaks. Is it able to 
continue, or must it stop immediately?

Tandem had systems with both approaches:

The NonStop proprietary operating system had nodes with lock-step processors 
and lots of error checking that would stop immediately when something broke. A 
CPU failure turned into a node halt. There was a bunch of work to have nodes 
move their state around so that terminal sessions would not be interrupted, 
transactions would be rolled back, and everything would be in a consistent 
state.

The Integrity Unix range was based on MIPS RISC/os, with a lot of work at 
Tandem. We had the R2000 and later the R3000 based systems. They had three CPUs 
all in lock step with voting ("triple modular redundancy"), and entirely 
duplicated memory, all with ECC. Redundant busses, separate cabinets for 
controllers and separate cabinets for each side of the disk mirror. You could 
pull out a CPU board and memory board, show a manager, and then plug them back 
in.

Tandem claimed to have removed 80% of panics from the kernel, and changed the 
device driver architecture so that they could recover from some driver faults 
by reinitialising driver state on a running system.

We still had some outages on this system, all caused by software. It was also 
expensive: AUD$1,000,000 for a system with the same underlying CPU/memory as a 
$30k MIPS workstation at the time. It was also slower because of the error 
checking overhead. However, it did crash much less than the MIPS boxes.

Coming back to the multicore issue:

The problem when a core fails is that it has affected more than its own state. 
It will be holding locks on shared resources and may have corrupted shared 
memory or asked a device to do the wrong thing. By the time you detect a fault 
in a core, it is too late. Checkpointing to main memory means that you need to 
be able to roll back to a checkpoint, and replay operations you know about. 
That involves more that CPU core state, that includes process file and device 
state.

The Tandem lesson is that it much easier when you involve the higher level 
software in dealing with these issues. Building a system where you can make the 
application programmer ignorant of the need to deal with failure is much harder 
than when you expose units of work to the application programmer and can just 
fail a node and replay the work somewhere else. Transactions are your friend.

Lots of literature on this stuff. My favourite is "Transaction Processing: 

Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer

On 2/14/12 9:27 AM, Rayson Ho wrote:

On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer  wrote:

but I'm interested in any answers people may have

The way other OSes handle this is by detecting any abnormal amounts of
faults (sometimes it's not the fault of the hardware - eg. when a
partical from the outerspace hits a core and flips the bit), then the
disable the core(s).

Solaris&  mainframe (z/OS) handle it this way, but you should google
and find more info since I don't remember all the details.

Also, see this presentation: "Getting to know the Solaris Fault
Management Architecture (FMA)":
http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf
True, but you can't guarantee that a cpu is going to fail in a way 
that you can detect like that.
what if the clock just stops..  I believe that even those systems that 
support cpu deactivation on
error only catch some percentage of the problems, and that sometimes 
it was more of

"bring up the system without cpu X after it all crashed in flames".

tandem and other systems in the old day s used to be able to cope with 
dying cpus pretty well
but they had support from to to bottom and the software was written 
with 'clustering' in mind.







Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"





___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Jos Backus
If you're able to install a port, it has a tool called shmux which you
can invoke with `-r sh', it may do what you want.

Jos
-- 
Jos Backus
jos at catnook.com
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Matthew Story
On Tue, Feb 14, 2012 at 2:37 PM, Matthew Story wrote:

> On Tue, Feb 14, 2012 at 2:35 PM, Jilles Tjoelker  wrote:
>
>> On Tue, Feb 14, 2012 at 01:34:49PM -0500, Matthew Story wrote:
>> > After reading the man-page, and browsing around the internet for a
>> minute,
>> > I was just wondering if there is an option in (any) xargs to
>> short-circuit
>> > on first failure of [utility [arguments]].
>>
>> > e.g.
>>
>> > $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker ||
>> echo $?
>> > 1
>> > 1
>>
>> > such that any non-0 exit code in a child process would cause xargs to
>> stop
>> > processing.  seems like this would be a nice feature to have.
>>
>> As per xargs(1), you can do this by having the command exit on a signal
>> or with a value of 255.
>>
>
exit 255 with -P, and SIGTERM (with or without -P) causes FreeBSD xargs to
orphan, is this desirable behavior?  findutils xargs orphans on 255 and
SIGTERM (with -P), but does not orphan without -P when SIGTERM is sent.  I
would expect xargs to propegate the signal, or wait, although the man page
does say "immediately", the POSIX specification is less clear ... this
makes it more-or-less unsuitable for my needs, but i guess i could do
something like:

... | xargs sh -c '... exit 255;'
if [ $? -ne 0 ]; then
wait
# cleanup
exit 1
fi



>
> Yes indeed it does ... should have scoured further, thanks!
>
>
>>
>> --
>> Jilles Tjoelker
>>
>
>
>
> --
> regards,
> matt
>



-- 
regards,
matt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Matthew Story
On Tue, Feb 14, 2012 at 2:35 PM, Jilles Tjoelker  wrote:

> On Tue, Feb 14, 2012 at 01:34:49PM -0500, Matthew Story wrote:
> > After reading the man-page, and browsing around the internet for a
> minute,
> > I was just wondering if there is an option in (any) xargs to
> short-circuit
> > on first failure of [utility [arguments]].
>
> > e.g.
>
> > $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker ||
> echo $?
> > 1
> > 1
>
> > such that any non-0 exit code in a child process would cause xargs to
> stop
> > processing.  seems like this would be a nice feature to have.
>
> As per xargs(1), you can do this by having the command exit on a signal
> or with a value of 255.
>

Yes indeed it does ... should have scoured further, thanks!


>
> --
> Jilles Tjoelker
>



-- 
regards,
matt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Jilles Tjoelker
On Tue, Feb 14, 2012 at 01:34:49PM -0500, Matthew Story wrote:
> After reading the man-page, and browsing around the internet for a minute,
> I was just wondering if there is an option in (any) xargs to short-circuit
> on first failure of [utility [arguments]].

> e.g.

> $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker || echo $?
> 1
> 1

> such that any non-0 exit code in a child process would cause xargs to stop
> processing.  seems like this would be a nice feature to have.

As per xargs(1), you can do this by having the command exit on a signal
or with a value of 255.

-- 
Jilles Tjoelker
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


RE: xargs short-circuit

2012-02-14 Thread Devin Teske


> -Original Message-
> From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
> hack...@freebsd.org] On Behalf Of Matthew Story
> Sent: Tuesday, February 14, 2012 11:18 AM
> To: freebsd-hackers@freebsd.org
> Subject: Re: xargs short-circuit
> 
> On Tue, Feb 14, 2012 at 2:05 PM, Devin Teske
> wrote:
> 
> >
> >
> > > -Original Message-
> > > From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
> > > hack...@freebsd.org] On Behalf Of Matthew Story
> > > Sent: Tuesday, February 14, 2012 10:35 AM
> > > To: freebsd-hackers@freebsd.org
> > > Subject: xargs short-circuit
> > >
> > > After reading the man-page, and browsing around the internet for a
> > minute,
> > > I was just wondering if there is an option in (any) xargs to
> > short-circuit
> > > on first failure of [utility [arguments]].
> > >
> > > e.g.
> > >
> > > $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker ||
> > echo $?
> > > 1
> > > 1
> > >
> > > such that any non-0 exit code in a child process would cause xargs to
> > stop
> > > processing.  seems like this would be a nice feature to have.
> > >
> >
> > You can achieve this quite easily with a sub-shell:
> >
> > As a bourne-shell script:
> >
> > #!/bin/sh
> > jot - 1 10 | ( while read ARG1 REST; do
> >sh -c 'echo "$*"; exit 1' worker $ARG1 || exit $?
> >shift 1
> > done )
> >
> 
> read is often not sufficient for a variety of reasons, the most notable of
> them is that new-lines are valid in file names on most file systems.

Your original example/post neither requested nor implied that such functionality
was required.

If you need such functionality, then you should be using awk, perl, or some
other heavier-lifting code (can even be sh(1), but you'll sacrifice speed).
--  
Devin

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Matthew Story
On Tue, Feb 14, 2012 at 2:05 PM, Devin Teske wrote:

>
>
> > -Original Message-
> > From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
> > hack...@freebsd.org] On Behalf Of Matthew Story
> > Sent: Tuesday, February 14, 2012 10:35 AM
> > To: freebsd-hackers@freebsd.org
> > Subject: xargs short-circuit
> >
> > After reading the man-page, and browsing around the internet for a
> minute,
> > I was just wondering if there is an option in (any) xargs to
> short-circuit
> > on first failure of [utility [arguments]].
> >
> > e.g.
> >
> > $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker ||
> echo $?
> > 1
> > 1
> >
> > such that any non-0 exit code in a child process would cause xargs to
> stop
> > processing.  seems like this would be a nice feature to have.
> >
>
> You can achieve this quite easily with a sub-shell:
>
> As a bourne-shell script:
>
> #!/bin/sh
> jot - 1 10 | ( while read ARG1 REST; do
>sh -c 'echo "$*"; exit 1' worker $ARG1 || exit $?
>shift 1
> done )
>

read is often not sufficient for a variety of reasons, the most notable of
them is that new-lines are valid in file names on most file systems.  While
some shells do support a variety of options, POSIX only supports -r (raw,
treat backslashes as literal, not escape).

find . -print0 | xargs -0 -e ...

Is vastly nicer in most cases.  Additionally, xargs provides the
possibility of concurrency, via -P ... while you can spoof this with
trailing & and wait(1) in sh, this is both vastly more complicated than
xargs -P, and not as efficient in spawning jobs, it would be nice to be
able to have xargs stop spawning new jobs on first failure in this case,
and exit at last reap of existing child processes if the short-circuit flag
is sent:

find . -print0 | xargs -n1 -0 -e -P4 ...

My use-case is a CPU-bound operation running with concurrency and many more
jobs than concurrency, on failure xargs continues to work until finished to
report failure, which is a large number of wasted cycles, and box load.
 Would be nice to bail as early as possible in situations where any failure
is fatal to the larger operation.


>
> Or interactively in sh/bash:
>
> $ jot - 1 10 | ( while read ARG1 REST; do sh -c 'echo "$*"; exit 1' worker
> $ARG1
> || exit $?; shift 1;done )
>
> Or interactively in csh/tcsh:
>
> % jot - 1 10 | /bin/sh -c 'while read ARG1 REST; do sh -c '\''echo "$*";
> exit
> 1'\'' worker $ARG1 || exit $?; shift 1; done'
>
> --
> Devin
>
> _
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
>



-- 
regards,
matt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


RE: xargs short-circuit

2012-02-14 Thread Devin Teske


> -Original Message-
> From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
> hack...@freebsd.org] On Behalf Of Matthew Story
> Sent: Tuesday, February 14, 2012 10:35 AM
> To: freebsd-hackers@freebsd.org
> Subject: xargs short-circuit
> 
> After reading the man-page, and browsing around the internet for a minute,
> I was just wondering if there is an option in (any) xargs to short-circuit
> on first failure of [utility [arguments]].
> 
> e.g.
> 
> $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker || echo $?
> 1
> 1
> 
> such that any non-0 exit code in a child process would cause xargs to stop
> processing.  seems like this would be a nice feature to have.
> 

You can achieve this quite easily with a sub-shell:

As a bourne-shell script:

#!/bin/sh
jot - 1 10 | ( while read ARG1 REST; do
sh -c 'echo "$*"; exit 1' worker $ARG1 || exit $?
shift 1
done )

Or interactively in sh/bash:

$ jot - 1 10 | ( while read ARG1 REST; do sh -c 'echo "$*"; exit 1' worker $ARG1
|| exit $?; shift 1;done )

Or interactively in csh/tcsh:

% jot - 1 10 | /bin/sh -c 'while read ARG1 REST; do sh -c '\''echo "$*"; exit
1'\'' worker $ARG1 || exit $?; shift 1; done'

-- 
Devin

_
The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: xargs short-circuit

2012-02-14 Thread Matthew Story
On Tue, Feb 14, 2012 at 1:34 PM, Matthew Story wrote:

> After reading the man-page, and browsing around the internet for a minute,
> I was just wondering if there is an option in (any) xargs to short-circuit
> on first failure of [utility [arguments]].
>
> e.g.
>
> $ jot - 1 10 | xargs -e -n1 sh -c 'echo "$&"; exit 1' worker || echo $? #
cp error on my part, should not read echo exit 1, just exit 1

> 1
> 1
>
> such that any non-0 exit code in a child process would cause xargs to stop
> processing.  seems like this would be a nice feature to have.
>

apologies for the copy-paste error.

-- 
regards,
matt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Uffe Jakobsen



On 2012-02-14 18:13, Joshua Isom wrote:

On 2/14/2012 10:57 AM, Julian Elischer wrote:

On 2/14/12 6:23 AM, Maninya M wrote:

For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific
intervals
of time in main memory. Once a core fails, its previous state is
retrieved
from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make
it do
this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.

This question has always intrigued me, because I'm always amazed
that people actually try.
From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"? do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices, but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have



The only way I could see that it could be done, without direct hardware
support, would be to virtualize it similar to how valgrind works. You'll
take a speed hit bad enough to want to turn it off, but it could be
possible. Testing that it works well could just mean overclocking your
cpu until it starts crashing, and then seeing if it doesn't crash.




Sun/Fujitsu SPARC64 CPUs has had "mainframe class" memory mirroring, 
End-to-end ECC protection, register ECC and hardware instruction retry 
for many years now - for the exact resaons that we discuss here - fault 
tolerance, (high) availability etc - typically these features are called 
RAS (Reliability, availability and serviceability)



You can read more here:

http://www.fujitsu.com/global/services/computing/server/sparcenterprise/technology/availability/processor.html

/Uffe




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


xargs short-circuit

2012-02-14 Thread Matthew Story
After reading the man-page, and browsing around the internet for a minute,
I was just wondering if there is an option in (any) xargs to short-circuit
on first failure of [utility [arguments]].

e.g.

$ jot - 1 10 | xargs -e -n1 sh -c 'echo "$*"; echo exit 1' worker || echo $?
1
1

such that any non-0 exit code in a child process would cause xargs to stop
processing.  seems like this would be a nice feature to have.

-- 
regards,
matt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Eitan Adler
On Tue, Feb 14, 2012 at 12:05 PM, Jason Hellenthal  wrote:
> How about core redundancy ? effectively this would reduce the amount of
> available cores in half in you spread a process to run on two cores at
> the same time but with an option to adjust this per process etc... I
> don't see it as unfeasable.

There are a number of papers discussing core redundancy.  They pretty
much all work the same way: process the work on two different cores
(or verify some subset of the work on the second core), and wait for
both cores to return prior to the commit phase.

One example: www.eecs.umich.edu/~taustin/papers/MICRO32-diva.pdf
Another example: www.ee.duke.edu/~sorin/papers/ieeemicro08_argus.pdf

These don't use existing cores on a multi-core chip, but instead use a
"functional correctness" chip but I've seen designs that use the
former as well.


-- 
Eitan Adler
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Rayson Ho
On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer  wrote:
> but I'm interested in any answers people may have

The way other OSes handle this is by detecting any abnormal amounts of
faults (sometimes it's not the fault of the hardware - eg. when a
partical from the outerspace hits a core and flips the bit), then the
disable the core(s).

Solaris & mainframe (z/OS) handle it this way, but you should google
and find more info since I don't remember all the details.

Also, see this presentation: "Getting to know the Solaris Fault
Management Architecture (FMA)":
http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

>
>
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
>>
>
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"



-- 
Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance (re-send)

2012-02-14 Thread Rayson Ho
(The email below did not show up on the online archive - resending...)

-- Forwarded message --
From: Rayson Ho 
Date: Tue, Feb 14, 2012 at 12:27 PM
Subject: Re: OS support for fault tolerance


On Tue, Feb 14, 2012 at 11:57 AM, Julian Elischer  wrote:
> but I'm interested in any answers people may have

The way other OSes handle this is by detecting any abnormal amounts of
faults (sometimes it's not the fault of the hardware - eg. when a
partical from the outerspace hits a core and flips the bit), then the
disable the core(s).

Solaris & mainframe (z/OS) handle it this way, but you should google
and find more info since I don't remember all the details.

Also, see this presentation: "Getting to know the Solaris Fault
Management Architecture (FMA)":
http://www.prefetch.net/presentations/SolarisFaultManagement_Presentation.pdf

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

>
>
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
>>
>
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Brandon Falk
On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
>
> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
>> On 2/14/12 6:23 AM, Maninya M wrote:
>>> For multicore desktop computers, suppose one of the cores fails, the
>>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>>> this hardware fault.
>>> The strategy is to checkpoint the state of each core at specific intervals
>>> of time in main memory. Once a core fails, its previous state is retrieved
>>> from the main memory, and the processes that were running on it are
>>> rescheduled on the remaining cores.
>>>
>>> I read that the OS tolerates faults in large servers. I need to make it do
>>> this for a Desktop OS. I assume I would have to change the scheduler
>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>>> How do I go about doing this? What exactly do I need to save for the
>>> "state" of the core? What else do I need to know?
>>> I have absolutely no experience with kernel programming or with FreeBSD.
>>> Any pointers to good sources about modifying the source-code of FreeBSD
>>> would be greatly appreciated.
>> This question has always intrigued me, because I'm always amazed
>> that people actually try.
>>  From my viewpoint, There's really not much you can do if the core
>> that is currently holding the scheduler lock fails.
>> And what do you mean by 'fails"?  do you run constant diagnostics?
>> how do you tell when it is failed? It'd be hard to detect that 'multiply'
>> has suddenly started giving bad results now and then.
>>
>> if it just "stops" then you might be able to have a watchdog that
>> notices,  but what do you do when it was half way through rearranging
>> a list of items? First, you have to find out that it held
>> the lock for the module and then you have to find out what it had
>> done and clean up the mess.
>>
>> This requires rewriting many many parts of the kernel to remove
>> 'transient inconsistent states". and even then, what do you do if it
>> was half way through manipulating some hardware..
>>
>> and when you've figured that all out, how do you cope with the
>> mess it made because it was dying?
>> Say for example it had started calculating bad memory offsets
>> before writing out some stuff and written data out over random memory?
>>
>> but I'm interested in any answers people may have
>>
> How about core redundancy ? effectively this would reduce the amount of
> available cores in half in you spread a process to run on two cores at
> the same time but with an option to adjust this per process etc... I
> don't see it as unfeasable.
>

The overhead for all of the error checking and redundancy makes this idea pretty
impractical. You'd have to have 2 cores to do the exact same thing, then some
'master' core that makes sure they're doing the right stuff, and if you really
want to think about it... what if the core monitoring the cores fails... there's
a threshold of when redundancy gets pointless.

Perhaps I'm missing out on something, but you can't check the checker (without
infinite redundancy).

Honestly, if you're worried about a core failing, please take your server
cluster out of the 1000 deg C forge.

-Brandon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Joshua Isom

On 2/14/2012 10:57 AM, Julian Elischer wrote:

On 2/14/12 6:23 AM, Maninya M wrote:

For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific
intervals
of time in main memory. Once a core fails, its previous state is
retrieved
from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make
it do
this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.

This question has always intrigued me, because I'm always amazed
that people actually try.
 From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"? do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices, but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have



The only way I could see that it could be done, without direct hardware 
support, would be to virtualize it similar to how valgrind works. 
You'll take a speed hit bad enough to want to turn it off, but it could 
be possible.  Testing that it works well could just mean overclocking 
your cpu until it starts crashing, and then seeing if it doesn't crash.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread mdf
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer  wrote:
> On 2/14/12 6:23 AM, Maninya M wrote:
>>
>> For multicore desktop computers, suppose one of the cores fails, the
>> FreeBSD OS crashes. My question is about how I can make the OS tolerate
>> this hardware fault.
>> The strategy is to checkpoint the state of each core at specific intervals
>> of time in main memory. Once a core fails, its previous state is retrieved
>> from the main memory, and the processes that were running on it are
>> rescheduled on the remaining cores.
>>
>> I read that the OS tolerates faults in large servers. I need to make it do
>> this for a Desktop OS. I assume I would have to change the scheduler
>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
>> How do I go about doing this? What exactly do I need to save for the
>> "state" of the core? What else do I need to know?
>> I have absolutely no experience with kernel programming or with FreeBSD.
>> Any pointers to good sources about modifying the source-code of FreeBSD
>> would be greatly appreciated.
>
> This question has always intrigued me, because I'm always amazed
> that people actually try.
> From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.

We did this at IBM after we'd done the dynamic logical partitioning.
Basically, there was a way to probe the CPU for the number of
correctable errors it was encountering.  At too high a threshhold, it
was considered "faulty" and we offlined the CPU before it encountered
an uncorrectable error.

We did the same thing for memory, too (that one I was directly involved in).

The basic trouble, though, is that at least for memory, there didn't
seem to be a correlation between the rate of correctable ECC and an
uncorrectable error occurring.

> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.

I'd assume this is predicated by the ability of the hardware to have
some redundancy and some way to query the error rate.  I've done a
little work with memory ECC on the device driver end, and at least
there hardware definitely reports correctable and uncorrectable ECC
via some registers.  But I don't know if there's any way to query this
for a CPU (and of course each CPU would be different).

However, all that said, it's a moderately large project to get an OS
ready to handle things like holes appearing in its logical CPU ID
space (how do you serialize this when you want the common case to not
take a lock?), and to do all the wizardry of unscheduling (what do you
do with a bound thread?) and then actually shutting the CPU down via
firmware so it doesn't continue running.  I started working on this
for Linux when I worked at IBM, somewhere around 2004, and then IBM
got sued by SCO so they pulled me off the project.  It was finished up
by a colleague and friend.

You can probably come to a first approximation by forcing e.g. the
idle thread to not get switched out, when the CPU appears unstable.
Then at least it's running fewer instructions, and less likely to
generate a machine check.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


9.0 observations

2012-02-14 Thread rank1seeker

OpenSSH:

After taking advantage of new 'KexAlgorithms'
# sshd -T | grep KexAlgorithms
will never show it ...


-
WiFi:
-
'media OFDM/54Mbps' breaks setup (supplied to 'ifconfig wlan0').
'ucastrate' and 'mcastrate' will set it instead.


-
gpart
-

On a MD vnode bassed image, of size:

1g or 2g:
# gpart create -s MBR md0
Will create starting offset at 63 sector
=> 63  2097089  md0  MBR  (1.0G)
=> 63  4194241  md0  MBR  (2.0G)

1432m:
# gpart create -s MBR md0
Will create starting offset at 33 sector
=> 33  2932703  md0  MBR  (1.4G)


NOW, looking at this new interesting alignment flag (-a 4k) ...
I started to add slices with it and taking into consideration BOTH above cases, 
all it really does under MBR, is it takes INITIAL offset and simply STAMPS it 
between slices, making NONE to align(nor to offset, nor to size => mess!):
1g or 2g:
   63   63 - free -  (31k)
1432m:
   33   33 - free -  (16k)


However, with GPT, all is stable:
# gpart create ... always sets offset to 34, regardless of img size
And (-a 4k) properly modifies BOTH slice's 'offset' and 'size', to be divisable 
with 8, without residue(=0)

In case:
--
# gpart show -p md0
=> 34  2932669md0  GPT  (1.4G)
   34 1024  md0p1  freebsd-boot  (512k)
 10586 - free -  (3.0k)
 1064   501760  md0p2  freebsd-ufs  (245M)
--
6 sectors were added in favor of aligning md0p2's offset
Here I have a question. Is it true that FIRST slice, should always start at 1Mb 
offset (-b 1M) and why?
Should I use (-b 1M) for first and (-a 4k) for all other added slices?


Finally, taking into consideration first MBR alignment issues.
How should one proceed if he wants to put MBR on 4k sector disk?



Domagoj Smolčić

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: OS support for fault tolerance

2012-02-14 Thread Jason Hellenthal


On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
> On 2/14/12 6:23 AM, Maninya M wrote:
> > For multicore desktop computers, suppose one of the cores fails, the
> > FreeBSD OS crashes. My question is about how I can make the OS tolerate
> > this hardware fault.
> > The strategy is to checkpoint the state of each core at specific intervals
> > of time in main memory. Once a core fails, its previous state is retrieved
> > from the main memory, and the processes that were running on it are
> > rescheduled on the remaining cores.
> >
> > I read that the OS tolerates faults in large servers. I need to make it do
> > this for a Desktop OS. I assume I would have to change the scheduler
> > program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
> > How do I go about doing this? What exactly do I need to save for the
> > "state" of the core? What else do I need to know?
> > I have absolutely no experience with kernel programming or with FreeBSD.
> > Any pointers to good sources about modifying the source-code of FreeBSD
> > would be greatly appreciated.
> This question has always intrigued me, because I'm always amazed
> that people actually try.
>  From my viewpoint, There's really not much you can do if the core
> that is currently holding the scheduler lock fails.
> And what do you mean by 'fails"?  do you run constant diagnostics?
> how do you tell when it is failed? It'd be hard to detect that 'multiply'
> has suddenly started giving bad results now and then.
> 
> if it just "stops" then you might be able to have a watchdog that
> notices,  but what do you do when it was half way through rearranging
> a list of items? First, you have to find out that it held
> the lock for the module and then you have to find out what it had
> done and clean up the mess.
> 
> This requires rewriting many many parts of the kernel to remove
> 'transient inconsistent states". and even then, what do you do if it
> was half way through manipulating some hardware..
> 
> and when you've figured that all out, how do you cope with the
> mess it made because it was dying?
> Say for example it had started calculating bad memory offsets
> before writing out some stuff and written data out over random memory?
> 
> but I'm interested in any answers people may have
> 

How about core redundancy ? effectively this would reduce the amount of
available cores in half in you spread a process to run on two cores at
the same time but with an option to adjust this per process etc... I
don't see it as unfeasable.

-- 
;s =;


pgpugcwqBhE9F.pgp
Description: PGP signature


Re: OS support for fault tolerance

2012-02-14 Thread Julian Elischer

On 2/14/12 6:23 AM, Maninya M wrote:

For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific intervals
of time in main memory. Once a core fails, its previous state is retrieved
from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make it do
this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.

This question has always intrigued me, because I'm always amazed
that people actually try.
From my viewpoint, There's really not much you can do if the core
that is currently holding the scheduler lock fails.
And what do you mean by 'fails"?  do you run constant diagnostics?
how do you tell when it is failed? It'd be hard to detect that 'multiply'
has suddenly started giving bad results now and then.

if it just "stops" then you might be able to have a watchdog that
notices,  but what do you do when it was half way through rearranging
a list of items? First, you have to find out that it held
the lock for the module and then you have to find out what it had
done and clean up the mess.

This requires rewriting many many parts of the kernel to remove
'transient inconsistent states". and even then, what do you do if it
was half way through manipulating some hardware..

and when you've figured that all out, how do you cope with the
mess it made because it was dying?
Say for example it had started calculating bad memory offsets
before writing out some stuff and written data out over random memory?

but I'm interested in any answers people may have


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


OS support for fault tolerance

2012-02-14 Thread Maninya M
For multicore desktop computers, suppose one of the cores fails, the
FreeBSD OS crashes. My question is about how I can make the OS tolerate
this hardware fault.
The strategy is to checkpoint the state of each core at specific intervals
of time in main memory. Once a core fails, its previous state is retrieved
from the main memory, and the processes that were running on it are
rescheduled on the remaining cores.

I read that the OS tolerates faults in large servers. I need to make it do
this for a Desktop OS. I assume I would have to change the scheduler
program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
How do I go about doing this? What exactly do I need to save for the
"state" of the core? What else do I need to know?
I have absolutely no experience with kernel programming or with FreeBSD.
Any pointers to good sources about modifying the source-code of FreeBSD
would be greatly appreciated.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


quick question regarding libarchive

2012-02-14 Thread _
Hi,

Have any changes been made to libarchive from FreeBSD 7.0 to 8.2 and is it
possbile that
these changes can report a tar.gz file corrupted when issuing gzip --test
archive.tar.gz?

When making my move from 7.0 to 8.2 I made backups, which when testing
these on 7.0
ran fine. However, now these archives are destroyed leaving me wondering
why?

Thanks

pancakeking79
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"