Le 2015-10-23 12:11, Filipe Manana a écrit :
On Fri, Oct 23, 2015 at 12:03 AM, Filipe Manana <fdman...@kernel.org> wrote:
On Thu, Oct 22, 2015 at 11:38 PM, Stéphane Lesimple
<stephane_bt...@lesimple.fr> wrote:
[ ... thread cleanup ... ]

Don't hesitate to ask if you need me to debug or even ftrace something.


Thanks Stéphane. I haven't seen that crash yet (still running tests
for 2 consecutive days now).
Can you please try the following patch, which works on top of mine,
and enable ftrace before running balance:

Debug patch:  https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor

Enable ftrace:

    $ echo > /sys/kernel/debug/tracing/trace
    $ echo "nop" > /sys/kernel/debug/tracing/current_tracer
    $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb   # if
you can use larger buffer size, even better
    $ echo > /sys/kernel/debug/tracing/set_ftrace_filter
    $ echo 1 > /sys/kernel/debug/tracing/tracing_on

    $ run balance... wait until it finishes with IO error or the
patch's printk message shows up in dmesg/syslog

    $ echo 0 > /sys/kernel/debug/tracing/tracing_on

    $ cat /sys/kernel/debug/tracing/trace > some_file.txt

Then send is some_file.txt for debugging, hopefully it will give some
useful information. Note that it might produce tons of messages,
depending on how long it takes for you to hit the BUG_ON.

Thanks a lot for this.


I'm compiling it now (using your v2 of the friendpaste diff).

I took the liberty to add a tracing_off() right before the return -EIO
so that the trace tail ends exactly at the right place.

Last time I tried to use ftrace to diagnose the bug we're trying to
fix, the system crashes so hard that usually it's complicated to get
the trace contents written somewhere before the system is unusable.
But I'll eventually work around it by using
/sys/kernel/debug/tracing/trace_pipe to send the trace live to another
machine over the LAN.

This series of bugs are so easy to trigger on my system that we'll
hopefully get something useful out of the trace. I guess that's a good
thing !


So, this time it took a little over an hour to get the crash, but it did
reach the -EIO condition eventually.
The ftrace log (2M gzipped) is available here :
http://www.speed47.net/tmp2/btrfs-4.3rc6p7463161-ftrace1.log.gz

The associated kernel log is as follows :

[ 2880.178589] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 2880.178600]       Not tainted 4.3.0-rc6p7463161+ #3
[ 2880.178603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 3088.829429] Out of memory: Kill process 9449 (df-complex2simp) score 246
or sacrifice child
[ 3088.829435] Killed process 9449 (df-complex2simp) total-vm:964732kB,
anon-rss:943764kB, file-rss:0kB
[ 3600.197642] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 3600.197657]       Not tainted 4.3.0-rc6p7463161+ #3
[ 3600.197660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 3840.204146] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 3840.204180]       Not tainted 4.3.0-rc6p7463161+ #3
[ 3840.204219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 3993.671982] Out of memory: Kill process 11357 (df-complex2simp) score 227
or sacrifice child
[ 3993.671989] Killed process 11357 (df-complex2simp) total-vm:891608kB,
anon-rss:870704kB, file-rss:60kB
[ 4080.210324] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 4080.210336]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4080.210339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 4320.215635] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 4320.215662]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4320.215667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 4560.221119] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 4560.221146]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4560.221148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 4800.226884] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 4800.226898]       Not tainted 4.3.0-rc6p7463161+ #3
[ 4800.226902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 4890.116131] Out of memory: Kill process 13377 (df-complex2simp) score 207
or sacrifice child
[ 4890.116138] Killed process 13377 (df-complex2simp) total-vm:834976kB,
anon-rss:793272kB, file-rss:48kB
[ 5785.793580] Out of memory: Kill process 15285 (df-complex2simp) score 201
or sacrifice child
[ 5785.793586] Killed process 15285 (df-complex2simp) total-vm:802208kB,
anon-rss:772172kB, file-rss:4kB
[ 6480.269728] INFO: task btrfs-transacti:7358 blocked for more than 120
seconds.
[ 6480.269738]       Not tainted 4.3.0-rc6p7463161+ #3
[ 6480.269740] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 7081.967354] BTRFS: here, ref_mod != 1, bytenr 12090260504576, ref_mod 2,
seq 0 action 1
[ 7081.967784] BTRFS: error (device dm-3) in btrfs_run_delayed_refs:2872:
errno=-5 IO failure

The OOM conditions are unrelated, this is an rrdtool cron script that
apparently takes too much memory from time to time, I'll disable it for
further tests to get noise out of the way.

If I can be of some more help, just ask.

That's great Stéphane.
I'll take a look tomorrow morning at the logs and then get back to you.

Thanks a lot for that.

Stéphane, can you do the balance again but with the following new debug patch:

https://friendpaste.com/3NXzDFGcK4Hxi0yiypMgRO

(I added your tracing_off(); call)

It applies on top of my fix.
Also, are you by chance using qgroups (btrfs quotas)? If you are I
think I know what's the problem, but either way please run with this
new debug patch, so that I can confirm my ideas.

I am indeed using qgroups (use case for me is to be able to tell how much real space are taking snapshots).

I'm applying your patch and resuming the balance, will keep you posted as usual!

Many thanks, much appreciated!

And thanks for taking the time to look into this :)

--
Stéphane.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to