Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

2017-03-14 Thread Mark Millard
A single Byte access to a 4K Byte aligned region between
the fork and wait/sleep/swap-out prevents that specific
4K Byte region from having the (bad) zeros.

Sounds like a page sized unit of behavior to me.

Details follow.

On 2017-Mar-14, at 3:28 PM, Mark Millard  wrote:

> [test_check() between the fork and the wait/sleep prevents the
> failure from occurring. Even a small access to the memory at
> that stage prevents the failure. Details follow.]
> 
> On 2017-Mar-14, at 11:07 AM, Mark Millard  wrote:
> 
>> [This is just a correction to the subject-line text to say arm64
>> instead of amd64.]
>> 
>> On 2017-Mar-14, at 12:58 AM, Mark Millard  wrote:
>> 
>> [Another correction I'm afraid --about alternative program variations
>> this time.]
>> 
>> On 2017-Mar-13, at 11:52 PM, Mark Millard  wrote:
>> 
>>> I'm still at a loss about how to figure out what stages are messed
>>> up. (Memory coherency? Some memory not swapped out? Bad data swapped
>>> out? Wrong data swapped in?)
>>> 
>>> But at least I've found a much smaller/simpler example to demonstrate
>>> some problem with in my Pine64+_ 2GB context.
>>> 
>>> The Pine64+ 2GB is the only amd64 context that I have access to.
>> 
>> Someday I'll learn to type arm64 the first time instead of amd64.
>> 
>>> The following program fails its check for data
>>> having its expected byte pattern in dynamically
>>> allocated memory after a fork/swap-out/swap-in
>>> sequence.
>>> 
>>> I'll note that the program sleeps for 60s after
>>> forking to give time to do something else to
>>> cause the parent and child processes to swap
>>> out (RES=0 as seen in top).
>> 
>> The following about the extra test_check() was
>> wrong.
>> 
>>> Note the source code line:
>>> 
>>> // test_check(); // Adding this line prevents failure.
>>> 
>>> It seem that accessing the region contents before forking
>>> and swapping avoids the problem. But there is a problem
>>> if the region was only written-to before the fork/swap.
> 
> There is a place that if a test_check call is put then the
> problem does not happen at any stage: I tried putting a
> call between the fork and the later wait/sleep code:

I changed the byte sequence patterns to avoid
zero values since the bad values are zeros:

static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); }
  // value now avoids the zero value since the failures
  // are zeros.

With that I can then test accurately what bytes have
bad values vs. do not. I also changed to:

void partial_test_check(void) {
if (value(0u)!=gbl_region.array[0])raise(SIGABRT);
if (value(0u)!=(*dyn_region).array[0]) raise(SIGABRT);
}

since previously [0] had a zero value and so I'd used [1].

On this basis I'm now using the below. See the comments tied
to partial_test_check() calls:

extern void test_setup(void); // Sets up the memory byte patterns.
extern void test_check(void); // Tests the memory byte patterns.
extern void partial_test_check(void); // Tests just [0] of each region
  // (gbl_region and dyn_region).

int main(void) {
test_setup();
test_check(); // Before fork() [passes]

pid_t pid = fork();
int wait_status = 0;;

// After fork; before waitsleep/swap-out.

if (0==pid) partial_test_check();
 // Even the above is sufficient by
 // itself to prevent failure for
 // region_size 1u through
 // 4u*1024u!
 // But 4u*1024u+1u and above fail
 // with this access to memory.
 // The failing test is of
 // (*dyn_region).array[4096u].
 // This test never fails here.

if (0 This suggests to me that the small access is forcing one or more things to
> be initialized for memory access that fork is not establishing of itself.
> It appears that if established correctly then the swap-out/swap-in
> sequence would work okay without needing the manual access to the memory.
> 
> 
> So far via this test I've not seen any evidence 

IPFW kernel build failing on 11.0-STABLE

2017-03-14 Thread Steven Borrelli via freebsd-stable
Tried building IPFW into my kernel and it failed midway with this:

cc  -c -O2 -pipe -fno-strict-aliasing  -g -nostdinc  -I.
-I/usr/src/sys -I/usr/src/sys/contrib/libfdt -D_KERNEL -DHAVE_
KERNEL_OPTION_HEADERS -include opt_global.h  -fno-omit-frame-pointer
-mno-omit-leaf-frame-pointer -MD  -MF.depend.ppb_1
284.o -MTppb_1284.o -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse
-msoft-float  -fno-asynchronous-unwind-tables -ffre
estanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls
-Wnested-externs -Wstrict-prototypes  -Wmissing-p
rototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
-Wno-pointer-sign -D__printf__=__freebsd_kprintf__  -Wmissing-i
nclude-dirs -fdiagnostics-show-option  -Wno-unknown-pragmas
-Wno-error-tautological-compare -Wno-error-empty-body  -Wn
o-error-parentheses-equality -Wno-error-unused-function
-Wno-error-pointer-sign -Wno-error-shift-negative-value  -mno-
aes -mno-avx  -std=iso9899:1999 -Werror  /usr/src/sys/dev/ppbus/ppb_1284.c
/usr/src/sys/dev/ppbus/ppb_1284.c:296:46: error: implicit conversion
from 'int' to 'char' changes value from 144 to
  -112 [-Werror,-Wconstant-conversion]
if ((error = do_peripheral_wait(bus, SELECT | nBUSY, 0))) {
 ~~  ~~~^~~
/usr/src/sys/dev/ppbus/ppb_1284.c:785:48: error: implicit conversion
from 'int' to 'char' changes value from 240 to -16
  [-Werror,-Wconstant-conversion]
if (do_1284_wait(bus, nACK | SELECT | PERROR | nBUSY,
  ~~~^~~
/usr/src/sys/dev/ppbus/ppb_1284.c:786:29: error: implicit conversion
from 'int' to 'char' changes value from 240 to -16
  [-Werror,-Wconstant-conversion]
nACK | SELECT | PERROR | nBUSY)) {
~~~^~~
/usr/src/sys/dev/ppbus/ppb_1284.c:841:37: error: implicit conversion
from 'int' to 'char' changes value from 200 to -56
  [-Werror,-Wconstant-conversion]
if (do_1284_wait(bus, nACK | nBUSY | nFAULT, nFAULT)) {
  ~^~~~
4 errors generated.
*** Error code 1
Stop.
make[2]: stopped in /usr/obj/usr/src/sys/IPFW
*** Error code 1
Stop.
make[1]: stopped in /usr/src
*** Error code 1
Stop.
make: stopped in /usr/src
q
[1]Exit 1( make buildkernel KERNCONF=IPFW
&& make installkernel KERNCONF=IPFW )
root@:/usr/src # uname -a
FreeBSD  11.0-STABLE FreeBSD 11.0-STABLE #0 r314941: Thu Mar  9
19:39:31 UTC 2017
r...@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

2017-03-14 Thread Mark Millard
On 2017-Mar-14, at 4:44 PM, Bernd Walter  wrote:

> On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
>> [test_check() between the fork and the wait/sleep prevents the
>> failure from occurring. Even a small access to the memory at
>> that stage prevents the failure. Details follow.]
> 
> Maybe a stupid question, since you might have written it somewhere.
> What medium do you swap to?
> I've seen broken firmware on microSD cards doing silent data
> corruption for some access patterns.

The root filesystem is on a USB SSD on a powered hub.

Only the kernel is from the microSD card.

I have several examples of the USB SSD model and have
never observed such problems in any other context.


The original issue that started this investigation
has been reported by several people on the lists:

Failed assertion: "tsd_booted"

on arm64 specifically, no other contexts so far as
I know. Earlier I had discovered that:

A) I could use a swap-in to cause the messages from
   instances of sh or su that had swapped out earlier.

B) The core dumps showed that a large memory region
   containing the global tsd_booted had all turned
   to be zero bytes. The assert is just exposing one
   of those zeros. (tsd_booted is from jemalloc that
   is in a .so that is loaded.)

This prompted me to look for simpler contexts involving
swapping that also show memory corruption.

So far I've only managed to produce corrupted memory when
fork and later swapping are both involved. Being a shared
library global is not a requirement for the problem,
although such contexts can have an issue. I've not made a
simpler example of that yet, although I tried.

I have not explored vfork, rfork, or any other alternatives.

So far all failure examples end up with zeroed memory when
the memory does not match the original pattern from before
the fork. At least that is what the core dumps show for all
examples that I've looked at.

See bugzilla 217138 and 217239. In some respects this example
is more analogous to the 217239 context as I remember.

My tests on amd64, armv6 (really -mcpu=cortex-a7 so armv7),
and powerpc64 have never produced any problems, including
never getting the failed assertion. Only arm64. (But I've
access to only one arm64 system, a Pine64+ 2GB.)

Prior to this I tracked down a different arm64 problem to
the fork_trampline code (for the child process) modifying
a system register but in a place allowing interrupts that
could also change the value. Andrew Turner fixed that
one at the time.

For this fork/swapping kind of issue I'm not sure that
I'll be able to do more than provide the simpler
example and the steps that I used. My isolating the
internal stage(s) and specific problem(s) at the code
level of detail does not seem likely.

But whatever is found needs to be able to explain the
contrast with an access after the fork but before the
swap preventing the failing behavior. So what I've got
so far hopefully does provide some hints to someone.

===
Mark Millard
markmi at dsl-only.net

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

2017-03-14 Thread Bernd Walter
On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote:
> [test_check() between the fork and the wait/sleep prevents the
> failure from occurring. Even a small access to the memory at
> that stage prevents the failure. Details follow.]

Maybe a stupid question, since you might have written it somewhere.
What medium do you swap to?
I've seen broken firmware on microSD cards doing silent data
corruption for some access patterns.

-- 
B.Walter  http://www.bwct.de
Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

2017-03-14 Thread Mark Millard
[test_check() between the fork and the wait/sleep prevents the
failure from occurring. Even a small access to the memory at
that stage prevents the failure. Details follow.]

On 2017-Mar-14, at 11:07 AM, Mark Millard  wrote:

> [This is just a correction to the subject-line text to say arm64
> instead of amd64.]
> 
> On 2017-Mar-14, at 12:58 AM, Mark Millard  wrote:
> 
> [Another correction I'm afraid --about alternative program variations
> this time.]
> 
> On 2017-Mar-13, at 11:52 PM, Mark Millard  wrote:
> 
>> I'm still at a loss about how to figure out what stages are messed
>> up. (Memory coherency? Some memory not swapped out? Bad data swapped
>> out? Wrong data swapped in?)
>> 
>> But at least I've found a much smaller/simpler example to demonstrate
>> some problem with in my Pine64+_ 2GB context.
>> 
>> The Pine64+ 2GB is the only amd64 context that I have access to.
> 
> Someday I'll learn to type arm64 the first time instead of amd64.
> 
>> The following program fails its check for data
>> having its expected byte pattern in dynamically
>> allocated memory after a fork/swap-out/swap-in
>> sequence.
>> 
>> I'll note that the program sleeps for 60s after
>> forking to give time to do something else to
>> cause the parent and child processes to swap
>> out (RES=0 as seen in top).
> 
> The following about the extra test_check() was
> wrong.
> 
>> Note the source code line:
>> 
>> // test_check(); // Adding this line prevents failure.
>> 
>> It seem that accessing the region contents before forking
>> and swapping avoids the problem. But there is a problem
>> if the region was only written-to before the fork/swap.

There is a place that if a test_check call is put then the
problem does not happen at any stage: I tried putting a
call between the fork and the later wait/sleep code:

int main(void) {
test_setup();
test_check(); // Before fork() [passes]

pid_t pid = fork();
int wait_status = 0;;

// test_check(); // After fork(); before wait/sleep. 
 // If used it prevents failure later!

if (0

arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]

2017-03-14 Thread Mark Millard
[This is just a correction to the subject-line text to say arm64
instead of amd64.]

On 2017-Mar-14, at 12:58 AM, Mark Millard  wrote:

[Another correction I'm afraid --about alternative program variations
this time.]

On 2017-Mar-13, at 11:52 PM, Mark Millard  wrote:

> I'm still at a loss about how to figure out what stages are messed
> up. (Memory coherency? Some memory not swapped out? Bad data swapped
> out? Wrong data swapped in?)
> 
> But at least I've found a much smaller/simpler example to demonstrate
> some problem with in my Pine64+_ 2GB context.
> 
> The Pine64+ 2GB is the only amd64 context that I have access to.

Someday I'll learn to type arm64 the first time instead of amd64.

> The following program fails its check for data
> having its expected byte pattern in dynamically
> allocated memory after a fork/swap-out/swap-in
> sequence.
> 
> I'll note that the program sleeps for 60s after
> forking to give time to do something else to
> cause the parent and child processes to swap
> out (RES=0 as seen in top).

The following about the extra test_check() was
wrong.

> Note the source code line:
> 
>  // test_check(); // Adding this line prevents failure.
> 
> It seem that accessing the region contents before forking
> and swapping avoids the problem. But there is a problem
> if the region was only written-to before the fork/swap.

This was because I'd carelessly moved some loop variables to
globals in a way that depended on the initialization of the
globals and the extra call changed those values.

I've noted code adjustments below (3 lines). I get the failures
with them as well.

> Another point is the size of the region matters: <= 14K Bytes
> fails and > 14K Bytes works for as much has I have tested.
> 
> 
> # more swap_testing.c
> // swap_testing.c
> 
> // Built via (c++ was clang++ 4.0 in my case):
> //
> // cc -g -std=c11 -Wpedantic swap_testing.c
> // -O0 and -O2 also gets the problem.
> 
> #include  // for fork(), sleep(.)
> #include   // for pid_t
> #include// for wait(.)
> 
> extern void test_setup(void); // Sets up the memory byte pattern.
> extern void test_check(void); // Tests the memory byte pattern.
> 
> int main(void)
> {
>  test_setup();
  test_check(); // This test passes.
> 
>  pid_t pid = fork();
>  int wait_status = 0;;
> 
>  if (0 
>  if (-1!=wait_status && 0<=pid)
>  {
>  if (0==pid)
>  {
>  sleep(60);
> 
>  // During this manually force this process to
>  // swap out. I use something like:
> 
>  // stress -m 1 --vm-bytes 1800M
> 
>  // in another shell and ^C'ing it after top
>  // shows the swapped status desired. 1800M
>  // just happened to work on the Pine64+ 2GB
>  // that I was using.
>  }
> 
>  test_check();
>  }
> }
> 
> // The memory and test code follows.
> 
> #include // for bool, true, false
> #include  // for size_t, NULL
> #include  // for malloc(.), free(.)
> 
> #include  // for raise(.), SIGABRT
> 
> #define region_size (14u*1024u)
>  // Bad dyn_region pattern, parent and child
>  // processes:
>  //  256u, 4u*1024u, 8u*1024u, 9u*1024u,
>  // 12u*1024u, 14u*1024u
> 
>  // Works:
>  // 14u*1024u+1u, 15u*1024u, 16u*1024u,
>  // 32u*1024u, 256u*1024u*1024u
> 
> typedef volatile unsigned char value_type;
> 
> struct region_struct { value_type array[region_size]; };
> typedef struct region_struct region;
> 
> static regiongbl_region;
> static region * volatile dyn_region = NULL;
> 
> static value_type value(size_t v) { return (value_type)v; }
> 
> void test_setup(void) {
>  dyn_region = malloc(sizeof(region));
>  if (!dyn_region) raise(SIGABRT);
> 
>  for(size_t i=0u; i  (*dyn_region).array[i] = gbl_region.array[i] = value(i);
>  }
> }
> 
> static volatile bool gbl_failed = false; // Until potentially disproved
> static volatile size_t gbl_pos = 0u;
> 
> static volatile bool dyn_failed = false; // Until potentially disproved
> static volatile size_t dyn_pos = 0u;
> 
> void test_check(void) {
  gbl_pos = 0u;
>  while (!gbl_failed && gbl_pos  gbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]);
>  gbl_pos++;
>  }
> 
  dyn_pos = 0u;
>  while (!dyn_failed && dyn_pos  dyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]);
>  // Note: When the memory pattern fails this case is that
>  //   records the failure.
>  dyn_pos++;
>  }
> 
>  if (gbl_failed) raise(SIGABRT);
>  if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise 
> call.
>  // when it fails (both parent and child 
> processes).
> }

I'm not bothering to redo the details below for the
line number variations.

> Other details from lldb 

[Bug 213903] Kernel crashes from turnstile_broadcast (/usr/src/sys/kern/subr_turnstile.c:837)

2017-03-14 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213903

--- Comment #17 from Cassiano Peixoto  ---
(In reply to Mateusz Guzik from comment #16)
Hi Mateusz,

Sorry but i can't try this patch, i had to rollback the old kernel to avoid
crashes. It's a production server and i can't let it down. :(

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: moutnroot failing on zpools in Azure after upgrade from 10 to 11 due to lack of waiting for da0

2017-03-14 Thread Pete French
> Are you sure the above transcript is right?  There are three reasons
> I'm asking.  First, you'll see the "Root mount waiting" message,
> which means the root mount code is, well, waiting for storvsc, exactly
> as expected.  Second - there is no "Trying to mount root".  But most
> of all - for some reason the "Mounting failed" is shown _before_ the
> "Root mount waiting", and I have no idea how this could ever happen.

OK, that's interesting, and kind of worrying! I belive, it's correct - I
have put the full trascript up here for you so you can see all of it:

https://www.twisted.org.uk/~pete/914893a3-249e-4a91-851c-f467fc185eec.txt

I am assuming that Azure's capturing of the outut is correct

-pete.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)

2017-03-14 Thread Mark Millard
[Another correction I'm afraid --about alternative program variations
this time.]

On 2017-Mar-13, at 11:52 PM, Mark Millard  wrote:

> I'm still at a loss about how to figure out what stages are messed
> up. (Memory coherency? Some memory not swapped out? Bad data swapped
> out? Wrong data swapped in?)
> 
> But at least I've found a much smaller/simpler example to demonstrate
> some problem with in my Pine64+_ 2GB context.
> 
> The Pine64+ 2GB is the only amd64 context that I have access to.

Someday I'll learn to type arm64 the first time instead of amd64.

> The following program fails its check for data
> having its expected byte pattern in dynamically
> allocated memory after a fork/swap-out/swap-in
> sequence.
> 
> I'll note that the program sleeps for 60s after
> forking to give time to do something else to
> cause the parent and child processes to swap
> out (RES=0 as seen in top).

The following about the extra test_check() was
wrong.

> Note the source code line:
> 
>   // test_check(); // Adding this line prevents failure.
> 
> It seem that accessing the region contents before forking
> and swapping avoids the problem. But there is a problem
> if the region was only written-to before the fork/swap.

This was because I'd carelessly moved some loop variables to
globals in a way that depended on the initialization of the
globals and the extra call changed those values.

I've noted code adjustments below (3 lines). I get the failures
with them as well.

> Another point is the size of the region matters: <= 14K Bytes
> fails and > 14K Bytes works for as much has I have tested.
> 
> 
> # more swap_testing.c
> // swap_testing.c
> 
> // Built via (c++ was clang++ 4.0 in my case):
> //
> // cc -g -std=c11 -Wpedantic swap_testing.c
> // -O0 and -O2 also gets the problem.
> 
> #include  // for fork(), sleep(.)
> #include   // for pid_t
> #include// for wait(.)
> 
> extern void test_setup(void); // Sets up the memory byte pattern.
> extern void test_check(void); // Tests the memory byte pattern.
> 
> int main(void)
> {
>   test_setup();
   test_check(); // This test passes.
> 
>   pid_t pid = fork();
>   int wait_status = 0;;
> 
>   if (0 
>   if (-1!=wait_status && 0<=pid)
>   {
>   if (0==pid)
>   {
>   sleep(60);
> 
>   // During this manually force this process to
>   // swap out. I use something like:
> 
>   // stress -m 1 --vm-bytes 1800M
> 
>   // in another shell and ^C'ing it after top
>   // shows the swapped status desired. 1800M
>   // just happened to work on the Pine64+ 2GB
>   // that I was using.
>   }
> 
>   test_check();
>   }
> }
> 
> // The memory and test code follows.
> 
> #include // for bool, true, false
> #include  // for size_t, NULL
> #include  // for malloc(.), free(.)
> 
> #include  // for raise(.), SIGABRT
> 
> #define region_size (14u*1024u)
>   // Bad dyn_region pattern, parent and child
>   // processes:
>   //  256u, 4u*1024u, 8u*1024u, 9u*1024u,
>   // 12u*1024u, 14u*1024u
> 
>   // Works:
>   // 14u*1024u+1u, 15u*1024u, 16u*1024u,
>   // 32u*1024u, 256u*1024u*1024u
> 
> typedef volatile unsigned char value_type;
> 
> struct region_struct { value_type array[region_size]; };
> typedef struct region_struct region;
> 
> static regiongbl_region;
> static region * volatile dyn_region = NULL;
> 
> static value_type value(size_t v) { return (value_type)v; }
> 
> void test_setup(void) {
>   dyn_region = malloc(sizeof(region));
>   if (!dyn_region) raise(SIGABRT);
> 
>   for(size_t i=0u; i   (*dyn_region).array[i] = gbl_region.array[i] = value(i);
>   }
> }
> 
> static volatile bool gbl_failed = false; // Until potentially disproved
> static volatile size_t gbl_pos = 0u;
> 
> static volatile bool dyn_failed = false; // Until potentially disproved
> static volatile size_t dyn_pos = 0u;
> 
> void test_check(void) {
   gbl_pos = 0u;
>   while (!gbl_failed && gbl_pos   gbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]);
>   gbl_pos++;
>   }
> 
   dyn_pos = 0u;
>   while (!dyn_failed && dyn_pos   dyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]);
>   // Note: When the memory pattern fails this case is that
>   //   records the failure.
>   dyn_pos++;
>   }
> 
>   if (gbl_failed) raise(SIGABRT);
>   if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise 
> call.
>   // when it fails (both parent and child 
> processes).
> }

I'm not bothering to redo the details below for the
line number variations.

> Other details from lldb (not using -O2 so things are
> simpler, not presented in the order examined):
> 
> # lldb a.out -c 

Re: amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)

2017-03-14 Thread Mark Millard

On 2017-Mar-13, at 11:52 PM, Mark Millard  wrote:

> I'm still at a loss about how to figure out what stages are messed
> up. (Memory coherency? Some memory not swapped out? Bad data swapped
> out? Wrong data swapped in?)
> 
> But at least I've found a much smaller/simpler example to demonstrate
> some problem with in my Pine64+_ 2GB context.
> 
> The Pine64+ 2GB is the only amd64 context that I have access to.

Someday I'll learn to type arm64 the first time instead of amd64.

> The following program fails its check for data
> having its expected byte pattern in dynamically
> allocated memory after a fork/swap-out/swap-in
> sequence.
> 
> I'll note that the program sleeps for 60s after
> forking to give time to do something else to
> cause the parent and child processes to swap
> out (RES=0 as seen in top).
> 
> Note the source code line:
> 
>// test_check(); // Adding this line prevents failure.
> 
> It seem that accessing the region contents before forking
> and swapping avoids the problem. But there is a problem
> if the region was only written-to before the fork/swap.
> 
> Another point is the size of the region matters: <= 14K Bytes
> fails and > 14K Bytes works for as much has I have tested.
> 
> 
> # more swap_testing.c
> // swap_testing.c
> 
> // Built via (c++ was clang++ 4.0 in my case):
> //
> // cc -g -std=c11 -Wpedantic swap_testing.c
> // -O0 and -O2 also gets the problem.
> 
> #include  // for fork(), sleep(.)
> #include   // for pid_t
> #include// for wait(.)
> 
> extern void test_setup(void); // Sets up the memory byte pattern.
> extern void test_check(void); // Tests the memory byte pattern.
> 
> int main(void)
> {
>test_setup();
>// test_check(); // Adding this line prevents failure.
> 
>pid_t pid = fork();
>int wait_status = 0;;
> 
>if (0 
>if (-1!=wait_status && 0<=pid)
>{
>if (0==pid)
>{
>sleep(60);
> 
>// During this manually force this process to
>// swap out. I use something like:
> 
>// stress -m 1 --vm-bytes 1800M
> 
>// in another shell and ^C'ing it after top
>// shows the swapped status desired. 1800M
>// just happened to work on the Pine64+ 2GB
>// that I was using.
>}
> 
>test_check();
>}
> }
> 
> // The memory and test code follows.
> 
> #include // for bool, true, false
> #include  // for size_t, NULL
> #include  // for malloc(.), free(.)
> 
> #include  // for raise(.), SIGABRT
> 
> #define region_size (14u*1024u)
>// Bad dyn_region pattern, parent and child
>// processes:
>//  256u, 4u*1024u, 8u*1024u, 9u*1024u,
>// 12u*1024u, 14u*1024u
> 
>// Works:
>// 14u*1024u+1u, 15u*1024u, 16u*1024u,
>// 32u*1024u, 256u*1024u*1024u
> 
> typedef volatile unsigned char value_type;
> 
> struct region_struct { value_type array[region_size]; };
> typedef struct region_struct region;
> 
> static regiongbl_region;
> static region * volatile dyn_region = NULL;
> 
> static value_type value(size_t v) { return (value_type)v; }
> 
> void test_setup(void) {
>dyn_region = malloc(sizeof(region));
>if (!dyn_region) raise(SIGABRT);
> 
>for(size_t i=0u; i(*dyn_region).array[i] = gbl_region.array[i] = value(i);
>}
> }
> 
> static volatile bool gbl_failed = false; // Until potentially disproved
> static volatile size_t gbl_pos = 0u;
> 
> static volatile bool dyn_failed = false; // Until potentially disproved
> static volatile size_t dyn_pos = 0u;
> 
> void test_check(void) {
>while (!gbl_failed && gbl_posgbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]);
>gbl_pos++;
>}
> 
>while (!dyn_failed && dyn_posdyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]);
>// Note: When the memory pattern fails this case is that
>//   records the failure.
>dyn_pos++;
>}
> 
>if (gbl_failed) raise(SIGABRT);
>if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise 
> call.
>// when it fails (both parent and child 
> processes).
> }
> 
> 
> Other details from lldb (not using -O2 so things are
> simpler, not presented in the order examined):
> 
> # lldb a.out -c /var/crash/a.out.11575.core
> (lldb) target create "a.out" --core "/var/crash/a.out.11575.core"
> Core file '/var/crash/a.out.11575.core' (aarch64) was loaded.
> (lldb) bt
> * thread #1, name = 'a.out', stop reason = signal SIGABRT
>  * frame #0: 0x40113d38 libc.so.7`_thr_kill + 8
>frame #1: libc.so.7`__raise(s=6) at raise.c:52
>frame #2: a.out`test_check at swap_testing.c:103
>frame #3: a.out`main at 

amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)

2017-03-14 Thread Mark Millard
I'm still at a loss about how to figure out what stages are messed
up. (Memory coherency? Some memory not swapped out? Bad data swapped
out? Wrong data swapped in?)

But at least I've found a much smaller/simpler example to demonstrate
some problem with in my Pine64+_ 2GB context.

The Pine64+ 2GB is the only amd64 context that I have access to.


The following program fails its check for data
having its expected byte pattern in dynamically
allocated memory after a fork/swap-out/swap-in
sequence.

I'll note that the program sleeps for 60s after
forking to give time to do something else to
cause the parent and child processes to swap
out (RES=0 as seen in top).

Note the source code line:

// test_check(); // Adding this line prevents failure.

It seem that accessing the region contents before forking
and swapping avoids the problem. But there is a problem
if the region was only written-to before the fork/swap.

Another point is the size of the region matters: <= 14K Bytes
fails and > 14K Bytes works for as much has I have tested.


# more swap_testing.c
// swap_testing.c

// Built via (c++ was clang++ 4.0 in my case):
//
// cc -g -std=c11 -Wpedantic swap_testing.c
// -O0 and -O2 also gets the problem.

#include  // for fork(), sleep(.)
#include   // for pid_t
#include// for wait(.)

extern void test_setup(void); // Sets up the memory byte pattern.
extern void test_check(void); // Tests the memory byte pattern.

int main(void)
{
test_setup();
// test_check(); // Adding this line prevents failure.

pid_t pid = fork();
int wait_status = 0;;

if (0 103  if (dyn_failed) raise(SIGABRT); // lldb reports this line for the 
__raise call.
   104  // when it