Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
A single Byte access to a 4K Byte aligned region between the fork and wait/sleep/swap-out prevents that specific 4K Byte region from having the (bad) zeros. Sounds like a page sized unit of behavior to me. Details follow. On 2017-Mar-14, at 3:28 PM, Mark Millardwrote: > [test_check() between the fork and the wait/sleep prevents the > failure from occurring. Even a small access to the memory at > that stage prevents the failure. Details follow.] > > On 2017-Mar-14, at 11:07 AM, Mark Millard wrote: > >> [This is just a correction to the subject-line text to say arm64 >> instead of amd64.] >> >> On 2017-Mar-14, at 12:58 AM, Mark Millard wrote: >> >> [Another correction I'm afraid --about alternative program variations >> this time.] >> >> On 2017-Mar-13, at 11:52 PM, Mark Millard wrote: >> >>> I'm still at a loss about how to figure out what stages are messed >>> up. (Memory coherency? Some memory not swapped out? Bad data swapped >>> out? Wrong data swapped in?) >>> >>> But at least I've found a much smaller/simpler example to demonstrate >>> some problem with in my Pine64+_ 2GB context. >>> >>> The Pine64+ 2GB is the only amd64 context that I have access to. >> >> Someday I'll learn to type arm64 the first time instead of amd64. >> >>> The following program fails its check for data >>> having its expected byte pattern in dynamically >>> allocated memory after a fork/swap-out/swap-in >>> sequence. >>> >>> I'll note that the program sleeps for 60s after >>> forking to give time to do something else to >>> cause the parent and child processes to swap >>> out (RES=0 as seen in top). >> >> The following about the extra test_check() was >> wrong. >> >>> Note the source code line: >>> >>> // test_check(); // Adding this line prevents failure. >>> >>> It seem that accessing the region contents before forking >>> and swapping avoids the problem. But there is a problem >>> if the region was only written-to before the fork/swap. > > There is a place that if a test_check call is put then the > problem does not happen at any stage: I tried putting a > call between the fork and the later wait/sleep code: I changed the byte sequence patterns to avoid zero values since the bad values are zeros: static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); } // value now avoids the zero value since the failures // are zeros. With that I can then test accurately what bytes have bad values vs. do not. I also changed to: void partial_test_check(void) { if (value(0u)!=gbl_region.array[0])raise(SIGABRT); if (value(0u)!=(*dyn_region).array[0]) raise(SIGABRT); } since previously [0] had a zero value and so I'd used [1]. On this basis I'm now using the below. See the comments tied to partial_test_check() calls: extern void test_setup(void); // Sets up the memory byte patterns. extern void test_check(void); // Tests the memory byte patterns. extern void partial_test_check(void); // Tests just [0] of each region // (gbl_region and dyn_region). int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid = fork(); int wait_status = 0;; // After fork; before waitsleep/swap-out. if (0==pid) partial_test_check(); // Even the above is sufficient by // itself to prevent failure for // region_size 1u through // 4u*1024u! // But 4u*1024u+1u and above fail // with this access to memory. // The failing test is of // (*dyn_region).array[4096u]. // This test never fails here. if (0 This suggests to me that the small access is forcing one or more things to > be initialized for memory access that fork is not establishing of itself. > It appears that if established correctly then the swap-out/swap-in > sequence would work okay without needing the manual access to the memory. > > > So far via this test I've not seen any evidence
IPFW kernel build failing on 11.0-STABLE
Tried building IPFW into my kernel and it failed midway with this: cc -c -O2 -pipe -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys -I/usr/src/sys/contrib/libfdt -D_KERNEL -DHAVE_ KERNEL_OPTION_HEADERS -include opt_global.h -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD -MF.depend.ppb_1 284.o -MTppb_1284.o -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float -fno-asynchronous-unwind-tables -ffre estanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-p rototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-i nclude-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wn o-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value -mno- aes -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/dev/ppbus/ppb_1284.c /usr/src/sys/dev/ppbus/ppb_1284.c:296:46: error: implicit conversion from 'int' to 'char' changes value from 144 to -112 [-Werror,-Wconstant-conversion] if ((error = do_peripheral_wait(bus, SELECT | nBUSY, 0))) { ~~ ~~~^~~ /usr/src/sys/dev/ppbus/ppb_1284.c:785:48: error: implicit conversion from 'int' to 'char' changes value from 240 to -16 [-Werror,-Wconstant-conversion] if (do_1284_wait(bus, nACK | SELECT | PERROR | nBUSY, ~~~^~~ /usr/src/sys/dev/ppbus/ppb_1284.c:786:29: error: implicit conversion from 'int' to 'char' changes value from 240 to -16 [-Werror,-Wconstant-conversion] nACK | SELECT | PERROR | nBUSY)) { ~~~^~~ /usr/src/sys/dev/ppbus/ppb_1284.c:841:37: error: implicit conversion from 'int' to 'char' changes value from 200 to -56 [-Werror,-Wconstant-conversion] if (do_1284_wait(bus, nACK | nBUSY | nFAULT, nFAULT)) { ~^~~~ 4 errors generated. *** Error code 1 Stop. make[2]: stopped in /usr/obj/usr/src/sys/IPFW *** Error code 1 Stop. make[1]: stopped in /usr/src *** Error code 1 Stop. make: stopped in /usr/src q [1]Exit 1( make buildkernel KERNCONF=IPFW && make installkernel KERNCONF=IPFW ) root@:/usr/src # uname -a FreeBSD 11.0-STABLE FreeBSD 11.0-STABLE #0 r314941: Thu Mar 9 19:39:31 UTC 2017 r...@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
On 2017-Mar-14, at 4:44 PM, Bernd Walterwrote: > On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: >> [test_check() between the fork and the wait/sleep prevents the >> failure from occurring. Even a small access to the memory at >> that stage prevents the failure. Details follow.] > > Maybe a stupid question, since you might have written it somewhere. > What medium do you swap to? > I've seen broken firmware on microSD cards doing silent data > corruption for some access patterns. The root filesystem is on a USB SSD on a powered hub. Only the kernel is from the microSD card. I have several examples of the USB SSD model and have never observed such problems in any other context. The original issue that started this investigation has been reported by several people on the lists: Failed assertion: "tsd_booted" on arm64 specifically, no other contexts so far as I know. Earlier I had discovered that: A) I could use a swap-in to cause the messages from instances of sh or su that had swapped out earlier. B) The core dumps showed that a large memory region containing the global tsd_booted had all turned to be zero bytes. The assert is just exposing one of those zeros. (tsd_booted is from jemalloc that is in a .so that is loaded.) This prompted me to look for simpler contexts involving swapping that also show memory corruption. So far I've only managed to produce corrupted memory when fork and later swapping are both involved. Being a shared library global is not a requirement for the problem, although such contexts can have an issue. I've not made a simpler example of that yet, although I tried. I have not explored vfork, rfork, or any other alternatives. So far all failure examples end up with zeroed memory when the memory does not match the original pattern from before the fork. At least that is what the core dumps show for all examples that I've looked at. See bugzilla 217138 and 217239. In some respects this example is more analogous to the 217239 context as I remember. My tests on amd64, armv6 (really -mcpu=cortex-a7 so armv7), and powerpc64 have never produced any problems, including never getting the failed assertion. Only arm64. (But I've access to only one arm64 system, a Pine64+ 2GB.) Prior to this I tracked down a different arm64 problem to the fork_trampline code (for the child process) modifying a system register but in a place allowing interrupts that could also change the value. Andrew Turner fixed that one at the time. For this fork/swapping kind of issue I'm not sure that I'll be able to do more than provide the simpler example and the steps that I used. My isolating the internal stage(s) and specific problem(s) at the code level of detail does not seem likely. But whatever is found needs to be able to explain the contrast with an access after the fork but before the swap preventing the failing behavior. So what I've got so far hopefully does provide some hints to someone. === Mark Millard markmi at dsl-only.net ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
On Tue, Mar 14, 2017 at 03:28:53PM -0700, Mark Millard wrote: > [test_check() between the fork and the wait/sleep prevents the > failure from occurring. Even a small access to the memory at > that stage prevents the failure. Details follow.] Maybe a stupid question, since you might have written it somewhere. What medium do you swap to? I've seen broken firmware on microSD cards doing silent data corruption for some access patterns. -- B.Walterhttp://www.bwct.de Modbus/TCP Ethernet I/O Baugruppen, ARM basierte FreeBSD Rechner uvm. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
[test_check() between the fork and the wait/sleep prevents the failure from occurring. Even a small access to the memory at that stage prevents the failure. Details follow.] On 2017-Mar-14, at 11:07 AM, Mark Millardwrote: > [This is just a correction to the subject-line text to say arm64 > instead of amd64.] > > On 2017-Mar-14, at 12:58 AM, Mark Millard wrote: > > [Another correction I'm afraid --about alternative program variations > this time.] > > On 2017-Mar-13, at 11:52 PM, Mark Millard wrote: > >> I'm still at a loss about how to figure out what stages are messed >> up. (Memory coherency? Some memory not swapped out? Bad data swapped >> out? Wrong data swapped in?) >> >> But at least I've found a much smaller/simpler example to demonstrate >> some problem with in my Pine64+_ 2GB context. >> >> The Pine64+ 2GB is the only amd64 context that I have access to. > > Someday I'll learn to type arm64 the first time instead of amd64. > >> The following program fails its check for data >> having its expected byte pattern in dynamically >> allocated memory after a fork/swap-out/swap-in >> sequence. >> >> I'll note that the program sleeps for 60s after >> forking to give time to do something else to >> cause the parent and child processes to swap >> out (RES=0 as seen in top). > > The following about the extra test_check() was > wrong. > >> Note the source code line: >> >> // test_check(); // Adding this line prevents failure. >> >> It seem that accessing the region contents before forking >> and swapping avoids the problem. But there is a problem >> if the region was only written-to before the fork/swap. There is a place that if a test_check call is put then the problem does not happen at any stage: I tried putting a call between the fork and the later wait/sleep code: int main(void) { test_setup(); test_check(); // Before fork() [passes] pid_t pid = fork(); int wait_status = 0;; // test_check(); // After fork(); before wait/sleep. // If used it prevents failure later! if (0
arm64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context) [Corrected subject: arm64!]
[This is just a correction to the subject-line text to say arm64 instead of amd64.] On 2017-Mar-14, at 12:58 AM, Mark Millardwrote: [Another correction I'm afraid --about alternative program variations this time.] On 2017-Mar-13, at 11:52 PM, Mark Millard wrote: > I'm still at a loss about how to figure out what stages are messed > up. (Memory coherency? Some memory not swapped out? Bad data swapped > out? Wrong data swapped in?) > > But at least I've found a much smaller/simpler example to demonstrate > some problem with in my Pine64+_ 2GB context. > > The Pine64+ 2GB is the only amd64 context that I have access to. Someday I'll learn to type arm64 the first time instead of amd64. > The following program fails its check for data > having its expected byte pattern in dynamically > allocated memory after a fork/swap-out/swap-in > sequence. > > I'll note that the program sleeps for 60s after > forking to give time to do something else to > cause the parent and child processes to swap > out (RES=0 as seen in top). The following about the extra test_check() was wrong. > Note the source code line: > > // test_check(); // Adding this line prevents failure. > > It seem that accessing the region contents before forking > and swapping avoids the problem. But there is a problem > if the region was only written-to before the fork/swap. This was because I'd carelessly moved some loop variables to globals in a way that depended on the initialization of the globals and the extra call changed those values. I've noted code adjustments below (3 lines). I get the failures with them as well. > Another point is the size of the region matters: <= 14K Bytes > fails and > 14K Bytes works for as much has I have tested. > > > # more swap_testing.c > // swap_testing.c > > // Built via (c++ was clang++ 4.0 in my case): > // > // cc -g -std=c11 -Wpedantic swap_testing.c > // -O0 and -O2 also gets the problem. > > #include // for fork(), sleep(.) > #include // for pid_t > #include// for wait(.) > > extern void test_setup(void); // Sets up the memory byte pattern. > extern void test_check(void); // Tests the memory byte pattern. > > int main(void) > { > test_setup(); test_check(); // This test passes. > > pid_t pid = fork(); > int wait_status = 0;; > > if (0 > if (-1!=wait_status && 0<=pid) > { > if (0==pid) > { > sleep(60); > > // During this manually force this process to > // swap out. I use something like: > > // stress -m 1 --vm-bytes 1800M > > // in another shell and ^C'ing it after top > // shows the swapped status desired. 1800M > // just happened to work on the Pine64+ 2GB > // that I was using. > } > > test_check(); > } > } > > // The memory and test code follows. > > #include // for bool, true, false > #include // for size_t, NULL > #include // for malloc(.), free(.) > > #include // for raise(.), SIGABRT > > #define region_size (14u*1024u) > // Bad dyn_region pattern, parent and child > // processes: > // 256u, 4u*1024u, 8u*1024u, 9u*1024u, > // 12u*1024u, 14u*1024u > > // Works: > // 14u*1024u+1u, 15u*1024u, 16u*1024u, > // 32u*1024u, 256u*1024u*1024u > > typedef volatile unsigned char value_type; > > struct region_struct { value_type array[region_size]; }; > typedef struct region_struct region; > > static regiongbl_region; > static region * volatile dyn_region = NULL; > > static value_type value(size_t v) { return (value_type)v; } > > void test_setup(void) { > dyn_region = malloc(sizeof(region)); > if (!dyn_region) raise(SIGABRT); > > for(size_t i=0u; i (*dyn_region).array[i] = gbl_region.array[i] = value(i); > } > } > > static volatile bool gbl_failed = false; // Until potentially disproved > static volatile size_t gbl_pos = 0u; > > static volatile bool dyn_failed = false; // Until potentially disproved > static volatile size_t dyn_pos = 0u; > > void test_check(void) { gbl_pos = 0u; > while (!gbl_failed && gbl_pos gbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]); > gbl_pos++; > } > dyn_pos = 0u; > while (!dyn_failed && dyn_pos dyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]); > // Note: When the memory pattern fails this case is that > // records the failure. > dyn_pos++; > } > > if (gbl_failed) raise(SIGABRT); > if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise > call. > // when it fails (both parent and child > processes). > } I'm not bothering to redo the details below for the line number variations. > Other details from lldb
[Bug 213903] Kernel crashes from turnstile_broadcast (/usr/src/sys/kern/subr_turnstile.c:837)
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213903 --- Comment #17 from Cassiano Peixoto--- (In reply to Mateusz Guzik from comment #16) Hi Mateusz, Sorry but i can't try this patch, i had to rollback the old kernel to avoid crashes. It's a production server and i can't let it down. :( -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: moutnroot failing on zpools in Azure after upgrade from 10 to 11 due to lack of waiting for da0
> Are you sure the above transcript is right? There are three reasons > I'm asking. First, you'll see the "Root mount waiting" message, > which means the root mount code is, well, waiting for storvsc, exactly > as expected. Second - there is no "Trying to mount root". But most > of all - for some reason the "Mounting failed" is shown _before_ the > "Root mount waiting", and I have no idea how this could ever happen. OK, that's interesting, and kind of worrying! I belive, it's correct - I have put the full trascript up here for you so you can see all of it: https://www.twisted.org.uk/~pete/914893a3-249e-4a91-851c-f467fc185eec.txt I am assuming that Azure's capturing of the outut is correct -pete. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)
[Another correction I'm afraid --about alternative program variations this time.] On 2017-Mar-13, at 11:52 PM, Mark Millardwrote: > I'm still at a loss about how to figure out what stages are messed > up. (Memory coherency? Some memory not swapped out? Bad data swapped > out? Wrong data swapped in?) > > But at least I've found a much smaller/simpler example to demonstrate > some problem with in my Pine64+_ 2GB context. > > The Pine64+ 2GB is the only amd64 context that I have access to. Someday I'll learn to type arm64 the first time instead of amd64. > The following program fails its check for data > having its expected byte pattern in dynamically > allocated memory after a fork/swap-out/swap-in > sequence. > > I'll note that the program sleeps for 60s after > forking to give time to do something else to > cause the parent and child processes to swap > out (RES=0 as seen in top). The following about the extra test_check() was wrong. > Note the source code line: > > // test_check(); // Adding this line prevents failure. > > It seem that accessing the region contents before forking > and swapping avoids the problem. But there is a problem > if the region was only written-to before the fork/swap. This was because I'd carelessly moved some loop variables to globals in a way that depended on the initialization of the globals and the extra call changed those values. I've noted code adjustments below (3 lines). I get the failures with them as well. > Another point is the size of the region matters: <= 14K Bytes > fails and > 14K Bytes works for as much has I have tested. > > > # more swap_testing.c > // swap_testing.c > > // Built via (c++ was clang++ 4.0 in my case): > // > // cc -g -std=c11 -Wpedantic swap_testing.c > // -O0 and -O2 also gets the problem. > > #include // for fork(), sleep(.) > #include // for pid_t > #include// for wait(.) > > extern void test_setup(void); // Sets up the memory byte pattern. > extern void test_check(void); // Tests the memory byte pattern. > > int main(void) > { > test_setup(); test_check(); // This test passes. > > pid_t pid = fork(); > int wait_status = 0;; > > if (0 > if (-1!=wait_status && 0<=pid) > { > if (0==pid) > { > sleep(60); > > // During this manually force this process to > // swap out. I use something like: > > // stress -m 1 --vm-bytes 1800M > > // in another shell and ^C'ing it after top > // shows the swapped status desired. 1800M > // just happened to work on the Pine64+ 2GB > // that I was using. > } > > test_check(); > } > } > > // The memory and test code follows. > > #include // for bool, true, false > #include // for size_t, NULL > #include // for malloc(.), free(.) > > #include // for raise(.), SIGABRT > > #define region_size (14u*1024u) > // Bad dyn_region pattern, parent and child > // processes: > // 256u, 4u*1024u, 8u*1024u, 9u*1024u, > // 12u*1024u, 14u*1024u > > // Works: > // 14u*1024u+1u, 15u*1024u, 16u*1024u, > // 32u*1024u, 256u*1024u*1024u > > typedef volatile unsigned char value_type; > > struct region_struct { value_type array[region_size]; }; > typedef struct region_struct region; > > static regiongbl_region; > static region * volatile dyn_region = NULL; > > static value_type value(size_t v) { return (value_type)v; } > > void test_setup(void) { > dyn_region = malloc(sizeof(region)); > if (!dyn_region) raise(SIGABRT); > > for(size_t i=0u; i (*dyn_region).array[i] = gbl_region.array[i] = value(i); > } > } > > static volatile bool gbl_failed = false; // Until potentially disproved > static volatile size_t gbl_pos = 0u; > > static volatile bool dyn_failed = false; // Until potentially disproved > static volatile size_t dyn_pos = 0u; > > void test_check(void) { gbl_pos = 0u; > while (!gbl_failed && gbl_pos gbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]); > gbl_pos++; > } > dyn_pos = 0u; > while (!dyn_failed && dyn_pos dyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]); > // Note: When the memory pattern fails this case is that > // records the failure. > dyn_pos++; > } > > if (gbl_failed) raise(SIGABRT); > if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise > call. > // when it fails (both parent and child > processes). > } I'm not bothering to redo the details below for the line number variations. > Other details from lldb (not using -O2 so things are > simpler, not presented in the order examined): > > # lldb a.out -c
Re: amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)
On 2017-Mar-13, at 11:52 PM, Mark Millardwrote: > I'm still at a loss about how to figure out what stages are messed > up. (Memory coherency? Some memory not swapped out? Bad data swapped > out? Wrong data swapped in?) > > But at least I've found a much smaller/simpler example to demonstrate > some problem with in my Pine64+_ 2GB context. > > The Pine64+ 2GB is the only amd64 context that I have access to. Someday I'll learn to type arm64 the first time instead of amd64. > The following program fails its check for data > having its expected byte pattern in dynamically > allocated memory after a fork/swap-out/swap-in > sequence. > > I'll note that the program sleeps for 60s after > forking to give time to do something else to > cause the parent and child processes to swap > out (RES=0 as seen in top). > > Note the source code line: > >// test_check(); // Adding this line prevents failure. > > It seem that accessing the region contents before forking > and swapping avoids the problem. But there is a problem > if the region was only written-to before the fork/swap. > > Another point is the size of the region matters: <= 14K Bytes > fails and > 14K Bytes works for as much has I have tested. > > > # more swap_testing.c > // swap_testing.c > > // Built via (c++ was clang++ 4.0 in my case): > // > // cc -g -std=c11 -Wpedantic swap_testing.c > // -O0 and -O2 also gets the problem. > > #include // for fork(), sleep(.) > #include // for pid_t > #include// for wait(.) > > extern void test_setup(void); // Sets up the memory byte pattern. > extern void test_check(void); // Tests the memory byte pattern. > > int main(void) > { >test_setup(); >// test_check(); // Adding this line prevents failure. > >pid_t pid = fork(); >int wait_status = 0;; > >if (0 >if (-1!=wait_status && 0<=pid) >{ >if (0==pid) >{ >sleep(60); > >// During this manually force this process to >// swap out. I use something like: > >// stress -m 1 --vm-bytes 1800M > >// in another shell and ^C'ing it after top >// shows the swapped status desired. 1800M >// just happened to work on the Pine64+ 2GB >// that I was using. >} > >test_check(); >} > } > > // The memory and test code follows. > > #include // for bool, true, false > #include // for size_t, NULL > #include // for malloc(.), free(.) > > #include // for raise(.), SIGABRT > > #define region_size (14u*1024u) >// Bad dyn_region pattern, parent and child >// processes: >// 256u, 4u*1024u, 8u*1024u, 9u*1024u, >// 12u*1024u, 14u*1024u > >// Works: >// 14u*1024u+1u, 15u*1024u, 16u*1024u, >// 32u*1024u, 256u*1024u*1024u > > typedef volatile unsigned char value_type; > > struct region_struct { value_type array[region_size]; }; > typedef struct region_struct region; > > static regiongbl_region; > static region * volatile dyn_region = NULL; > > static value_type value(size_t v) { return (value_type)v; } > > void test_setup(void) { >dyn_region = malloc(sizeof(region)); >if (!dyn_region) raise(SIGABRT); > >for(size_t i=0u; i (*dyn_region).array[i] = gbl_region.array[i] = value(i); >} > } > > static volatile bool gbl_failed = false; // Until potentially disproved > static volatile size_t gbl_pos = 0u; > > static volatile bool dyn_failed = false; // Until potentially disproved > static volatile size_t dyn_pos = 0u; > > void test_check(void) { >while (!gbl_failed && gbl_pos gbl_failed = (value(gbl_pos) != gbl_region.array[gbl_pos]); >gbl_pos++; >} > >while (!dyn_failed && dyn_pos dyn_failed = (value(dyn_pos) != (*dyn_region).array[dyn_pos]); >// Note: When the memory pattern fails this case is that >// records the failure. >dyn_pos++; >} > >if (gbl_failed) raise(SIGABRT); >if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise > call. >// when it fails (both parent and child > processes). > } > > > Other details from lldb (not using -O2 so things are > simpler, not presented in the order examined): > > # lldb a.out -c /var/crash/a.out.11575.core > (lldb) target create "a.out" --core "/var/crash/a.out.11575.core" > Core file '/var/crash/a.out.11575.core' (aarch64) was loaded. > (lldb) bt > * thread #1, name = 'a.out', stop reason = signal SIGABRT > * frame #0: 0x40113d38 libc.so.7`_thr_kill + 8 >frame #1: libc.so.7`__raise(s=6) at raise.c:52 >frame #2: a.out`test_check at swap_testing.c:103 >frame #3: a.out`main at
amd64 fork/swap data corruptions: A ~110 line C program demonstrating an example (Pine64+ 2GB context)
I'm still at a loss about how to figure out what stages are messed up. (Memory coherency? Some memory not swapped out? Bad data swapped out? Wrong data swapped in?) But at least I've found a much smaller/simpler example to demonstrate some problem with in my Pine64+_ 2GB context. The Pine64+ 2GB is the only amd64 context that I have access to. The following program fails its check for data having its expected byte pattern in dynamically allocated memory after a fork/swap-out/swap-in sequence. I'll note that the program sleeps for 60s after forking to give time to do something else to cause the parent and child processes to swap out (RES=0 as seen in top). Note the source code line: // test_check(); // Adding this line prevents failure. It seem that accessing the region contents before forking and swapping avoids the problem. But there is a problem if the region was only written-to before the fork/swap. Another point is the size of the region matters: <= 14K Bytes fails and > 14K Bytes works for as much has I have tested. # more swap_testing.c // swap_testing.c // Built via (c++ was clang++ 4.0 in my case): // // cc -g -std=c11 -Wpedantic swap_testing.c // -O0 and -O2 also gets the problem. #include // for fork(), sleep(.) #include // for pid_t #include// for wait(.) extern void test_setup(void); // Sets up the memory byte pattern. extern void test_check(void); // Tests the memory byte pattern. int main(void) { test_setup(); // test_check(); // Adding this line prevents failure. pid_t pid = fork(); int wait_status = 0;; if (0103 if (dyn_failed) raise(SIGABRT); // lldb reports this line for the __raise call. 104 // when it