On Thu, Nov 16, 2017 at 06:27:44PM +0100, Holger Hoffstätte wrote:
> On 11/16/17 18:07, Marc MERLIN wrote:
> > Sorry, was missing the kernel number in the subject, just fixed that.
> > 
> > On Thu, Nov 16, 2017 at 09:04:45AM -0800, Marc MERLIN wrote:
> >> My server now reboots every 20mn or so, with this.
> >> Sadly another BUG_ON() and it won't even tell me which filesystem
> >> it's on
> >>
> >> static inline u32 btrfs_extent_inline_ref_size(int type)
> >> {
> >>    if (type == BTRFS_TREE_BLOCK_REF_KEY ||
> >>        type == BTRFS_SHARED_BLOCK_REF_KEY)
> >>            return sizeof(struct btrfs_extent_inline_ref);
> >>    if (type == BTRFS_SHARED_DATA_REF_KEY)
> >>            return sizeof(struct btrfs_shared_data_ref) +
> >>                   sizeof(struct btrfs_extent_inline_ref);
> >>    if (type == BTRFS_EXTENT_DATA_REF_KEY)
> >>            return sizeof(struct btrfs_extent_data_ref) +
> >>                   offsetof(struct btrfs_extent_inline_ref, offset);
> >>    BUG();
> >>    return 0;
> >> }
> 
> This BUG() was recently removed and seems to be caused by some kind
> of persistent corruption, which is seen as invalid inline extent.
> See [1], [2] for details. Maybe you can backport them?
> Alternatively just give 4.14 a whirl, it's great.
> 
> -h
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4

First thanks a lot for the quick reply, it was super timely considering
my server was rebooting every 20mn :)
I've now been running 4.14 for a couple of hours, and things seem ok
btrfs-wise.

So, just so that I understand:
1) I do have some kind of FS problem/corruption (minor? major?)

2) it started crashing 4.9.36 and then 4.13 today, every 20mn, probably due to 
some background
cleaner process that kept starting and hitting the problem spot

3) 4.14 does not crash anymore, but it doesn't even report any problem either. 
Does it mean
the error that crashed the old kernel is minor enough that the new kernel 
doesn't bother even
logging it?

4) I just ran scrub on the filesystem and it ran fine.

Sadly, while the BUG_ON was another one that failed to say which
mountpoint was affected, through painful trial and error, I think I
found out that it was affecting the root filesystem.
Doing a check or check --repair on that FS will be a major pain (need a rescue
media with the right version of dmcrypt, bcache, btrfs kernel, and btrfs progs)

I'm asusming that running btrfs check --force on a mounted filesystem
that is being used is not going to give useful results, unless I leave
the FS read only. Correct?


As for 4.14, the serial console code seems broken though, I can't get login or 
bash
to work anymore on them:
[ 2786.305004] INFO: task login:5636 blocked for more than 120 seconds.
[ 2786.324648]       Tainted: G     U  W       
4.14.0-amd64-stkreg-sysrq-20171018 #1
[ 2786.347692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2786.371742] login           D    0  5636      1 0xa0020006
[ 2786.388826] Call Trace:
[ 2786.396756]  __schedule+0x4b3/0x5bd
[ 2786.408077]  schedule+0x89/0x9a
[ 2786.418070]  schedule_timeout+0x43/0x101
[ 2786.430728]  ? default_wake_function+0x12/0x14
[ 2786.444620]  ? woken_wake_function+0x11/0x13
[ 2786.457967]  ldsem_down_write+0xe0/0x1a8
[ 2786.470293]  ? ldsem_down_write+0xe0/0x1a8
[ 2786.483143]  ? __wake_up_common_lock+0xa6/0xcf
[ 2786.497039]  tty_ldisc_lock+0x16/0x30
[ 2786.508587]  ? tty_ldisc_lock+0x16/0x30
[ 2786.520655]  tty_ldisc_hangup+0xbb/0x170
[ 2786.533000]  __tty_hangup+0x15f/0x21d
[ 2786.544541]  tty_vhangup_session+0x13/0x15
[ 2786.557388]  disassociate_ctty+0x51/0x209
[ 2786.570004]  do_exit+0x43a/0x923
[ 2786.580262]  ? recalc_sigpending_tsk+0x42/0x49
[ 2786.594120]  do_group_exit+0x6c/0xa5
[ 2786.605419]  get_signal+0x46b/0x4b3
[ 2786.616464]  do_signal+0x37/0x5ed
[ 2786.626969]  ? list_add+0x34/0x34
[ 2786.637474]  ? C_SYSC_wait4+0x49/0x99
[ 2786.649099]  ? handle_mm_fault+0x10f/0x17f
[ 2786.661968]  prepare_exit_to_usermode+0x94/0xef
[ 2786.676115]  syscall_return_slowpath+0xb9/0xd9
[ 2786.690035]  do_fast_syscall_32+0xc3/0xfe
[ 2786.702897]  entry_SYSENTER_compat+0x4c/0x5b
[ 2786.716272] RIP: 0023:0xf7f45c29
[ 2786.726496] RSP: 002b:00000000ffb5d0f0 EFLAGS: 00000246 ORIG_RAX: 
0000000000000072
[ 2786.749827] RAX: fffffffffffffe00 RBX: 00000000ffffffff RCX: 0000000000000000
[ 2786.772104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000080504ec
[ 2786.794087] RBP: 00000000ffb5f638 R08: 0000000000000000 R09: 0000000000000000
[ 2786.794088] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2786.794088] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

[ 2665.988277] INFO: task bash:5685 blocked for more than 120 seconds.
[ 2665.988278]       Tainted: G     U  W       
4.14.0-amd64-stkreg-sysrq-20171018 #1
[ 2665.988279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2665.988281] bash            D    0  5685   5636 0xa0020086
[ 2665.988284] Call Trace:
[ 2665.988288]  __schedule+0x4b3/0x5bd
[ 2665.988291]  schedule+0x89/0x9a
[ 2665.988293]  schedule_preempt_disabled+0x15/0x1e
[ 2665.988294]  __mutex_lock.isra.1+0x16d/0x2e0
[ 2665.988298]  __mutex_lock_slowpath+0x13/0x15
[ 2665.988300]  ? __mutex_lock_slowpath+0x13/0x15
[ 2665.988301]  mutex_lock+0x2a/0x2d
[ 2665.988304]  tty_lock+0x31/0x3c
[ 2665.988306]  tty_release+0x48/0x53c
[ 2665.988310]  __fput+0xf0/0x190
[ 2665.988312]  ____fput+0xe/0x10
[ 2665.988314]  task_work_run+0x79/0x8c
[ 2665.988317]  do_exit+0x447/0x923
[ 2665.988320]  ? recalc_sigpending_tsk+0x42/0x49
[ 2665.988322]  do_group_exit+0x6c/0xa5
[ 2665.988323]  get_signal+0x46b/0x4b3
[ 2665.988327]  do_signal+0x37/0x5ed
[ 2665.988329]  ? group_send_sig_info+0x4e/0x56
[ 2665.988331]  ? SYSC_kill+0xa8/0x1b1
[ 2665.988333]  ? do_sigaction+0xbe/0x18b
[ 2665.988335]  ? __audit_syscall_entry+0xc2/0xe6
[ 2665.988338]  prepare_exit_to_usermode+0x94/0xef
[ 2665.988341]  syscall_return_slowpath+0xb9/0xd9
[ 2665.988343]  do_fast_syscall_32+0xc3/0xfe
[ 2665.988345]  entry_SYSENTER_compat+0x4c/0x5b
[ 2665.988347] RIP: 0023:0xf7f24c29
[ 2665.988348] RSP: 002b:00000000ffccf9ec EFLAGS: 00000206 ORIG_RAX: 
0000000000000025
[ 2665.988350] RAX: 0000000000000000 RBX: 0000000000001635 RCX: 0000000000000001
[ 2665.988351] RDX: 0000000000000001 RSI: 00000000080a0310 RDI: 0000000000000000
[ 2665.988351] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 2665.988352] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 2665.988353] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to