[ ... thread cleanup ... ]
Don't hesitate to ask if you need me to debug or even ftrace
something.
Thanks Stéphane. I haven't seen that crash yet (still running tests
for 2 consecutive days now).
Can you please try the following patch, which works on top of mine,
and enable ftrace before running balance:
Debug patch: https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor
Enable ftrace:
$ echo > /sys/kernel/debug/tracing/trace
$ echo "nop" > /sys/kernel/debug/tracing/current_tracer
$ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb # if
you can use larger buffer size, even better
$ echo > /sys/kernel/debug/tracing/set_ftrace_filter
$ echo 1 > /sys/kernel/debug/tracing/tracing_on
$ run balance... wait until it finishes with IO error or the
patch's printk message shows up in dmesg/syslog
$ echo 0 > /sys/kernel/debug/tracing/tracing_on
$ cat /sys/kernel/debug/tracing/trace > some_file.txt
Then send is some_file.txt for debugging, hopefully it will give
some
useful information. Note that it might produce tons of messages,
depending on how long it takes for you to hit the BUG_ON.
Thanks a lot for this.
I'm compiling it now (using your v2 of the friendpaste diff).
I took the liberty to add a tracing_off() right before the return
-EIO
so that the trace tail ends exactly at the right place.
Last time I tried to use ftrace to diagnose the bug we're trying to
fix, the system crashes so hard that usually it's complicated to get
the trace contents written somewhere before the system is unusable.
But I'll eventually work around it by using
/sys/kernel/debug/tracing/trace_pipe to send the trace live to
another
machine over the LAN.
This series of bugs are so easy to trigger on my system that we'll
hopefully get something useful out of the trace. I guess that's a
good
thing !
So, this time it took a little over an hour to get the crash, but it
did
reach the -EIO condition eventually.
The ftrace log (2M gzipped) is available here :
http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz
The associated kernel log is as follows :
[ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 2880.178600] Not tainted 4.3.0-rc6p7463161+ #3
[ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp)
score 246
or sacrifice child
[ 3088.829435] Killed process 9449 (df-complex2simp)
total-vm:964732kB,
anon-rss:943764kB, file-rss:0kB
[ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 3600.197657] Not tainted 4.3.0-rc6p7463161+ #3
[ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 3840.204180] Not tainted 4.3.0-rc6p7463161+ #3
[ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp)
score 227
or sacrifice child
[ 3993.671989] Killed process 11357 (df-complex2simp)
total-vm:891608kB,
anon-rss:870704kB, file-rss:60kB
[ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 4080.210336] Not tainted 4.3.0-rc6p7463161+ #3
[ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 4320.215662] Not tainted 4.3.0-rc6p7463161+ #3
[ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 4560.221146] Not tainted 4.3.0-rc6p7463161+ #3
[ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 4800.226898] Not tainted 4.3.0-rc6p7463161+ #3
[ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp)
score 207
or sacrifice child
[ 4890.116138] Killed process 13377 (df-complex2simp)
total-vm:834976kB,
anon-rss:793272kB, file-rss:48kB
[ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp)
score 201
or sacrifice child
[ 5785.793586] Killed process 15285 (df-complex2simp)
total-vm:802208kB,
anon-rss:772172kB, file-rss:4kB
[ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than
120
seconds.
[ 6480.269738] Not tainted 4.3.0-rc6p7463161+ #3
[ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576,
ref_mod 2,
seq 0 action 1
[ 7081.967784] BTRFS: error (device dm-3) in
btrfs_run_delayed_refs:2872:
errno=-5 IO failure
The OOM conditions are unrelated, this is an rrdtool cron script that
apparently takes too much memory from time to time, I'll disable it
for
further tests to get noise out of the way.
If I can be of some more help, just ask.