On Thu, Nov 16, 2017 at 01:45:51PM -0800, Marc MERLIN wrote:
> On Thu, Nov 16, 2017 at 06:27:44PM +0100, Holger Hoffstätte wrote:
> > On 11/16/17 18:07, Marc MERLIN wrote:
> > > Sorry, was missing the kernel number in the subject, just fixed that.
> > > 
> > > On Thu, Nov 16, 2017 at 09:04:45AM -0800, Marc MERLIN wrote:
> > >> My server now reboots every 20mn or so, with this.
> > >> Sadly another BUG_ON() and it won't even tell me which filesystem
> > >> it's on
> > >>
> > >> static inline u32 btrfs_extent_inline_ref_size(int type)
> > >> {
> > >>  if (type == BTRFS_TREE_BLOCK_REF_KEY ||
> > >>      type == BTRFS_SHARED_BLOCK_REF_KEY)
> > >>          return sizeof(struct btrfs_extent_inline_ref);
> > >>  if (type == BTRFS_SHARED_DATA_REF_KEY)
> > >>          return sizeof(struct btrfs_shared_data_ref) +
> > >>                 sizeof(struct btrfs_extent_inline_ref);
> > >>  if (type == BTRFS_EXTENT_DATA_REF_KEY)
> > >>          return sizeof(struct btrfs_extent_data_ref) +
> > >>                 offsetof(struct btrfs_extent_inline_ref, offset);
> > >>  BUG();
> > >>  return 0;
> > >> }
> > 
> > This BUG() was recently removed and seems to be caused by some kind
> > of persistent corruption, which is seen as invalid inline extent.
> > See [1], [2] for details. Maybe you can backport them?
> > Alternatively just give 4.14 a whirl, it's great.
> > 
> > -h
> > 
> > [1] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745
> > [2] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4
> 
> First thanks a lot for the quick reply, it was super timely considering
> my server was rebooting every 20mn :)
> I've now been running 4.14 for a couple of hours, and things seem ok
> btrfs-wise.
> 
> So, just so that I understand:
> 1) I do have some kind of FS problem/corruption (minor? major?)
>

As of my understanding, extent inline refs mismatch could come from
some runtime memory error, so btrfs carries the wrong extent inline
refs info. and calculate checksum in metadata block and then write it
to disk, the next time you read it, the checksum says OK, but extent
inline refs is invalid.  TBH, I could think of another reason.

> 2) it started crashing 4.9.36 and then 4.13 today, every 20mn, probably due 
> to some background
> cleaner process that kept starting and hitting the problem spot
> 
> 3) 4.14 does not crash anymore, but it doesn't even report any problem 
> either. Does it mean
> the error that crashed the old kernel is minor enough that the new kernel 
> doesn't bother even
> logging it?
>

If it doesn't cause problems during scrub, I'd say probably we're safe
now, otherwise at least you would see some error shown in dmesg and
some operations from userspace would be aborted.

No idea why the below hang occurs..

Thanks,

-liubo

> 4) I just ran scrub on the filesystem and it ran fine.
> 
> Sadly, while the BUG_ON was another one that failed to say which
> mountpoint was affected, through painful trial and error, I think I
> found out that it was affecting the root filesystem.
> Doing a check or check --repair on that FS will be a major pain (need a rescue
> media with the right version of dmcrypt, bcache, btrfs kernel, and btrfs 
> progs)
> 
> I'm asusming that running btrfs check --force on a mounted filesystem
> that is being used is not going to give useful results, unless I leave
> the FS read only. Correct?
> 
> 
> As for 4.14, the serial console code seems broken though, I can't get login 
> or bash
> to work anymore on them:
> [ 2786.305004] INFO: task login:5636 blocked for more than 120 seconds.
> [ 2786.324648]       Tainted: G     U  W       
> 4.14.0-amd64-stkreg-sysrq-20171018 #1
> [ 2786.347692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 2786.371742] login           D    0  5636      1 0xa0020006
> [ 2786.388826] Call Trace:
> [ 2786.396756]  __schedule+0x4b3/0x5bd
> [ 2786.408077]  schedule+0x89/0x9a
> [ 2786.418070]  schedule_timeout+0x43/0x101
> [ 2786.430728]  ? default_wake_function+0x12/0x14
> [ 2786.444620]  ? woken_wake_function+0x11/0x13
> [ 2786.457967]  ldsem_down_write+0xe0/0x1a8
> [ 2786.470293]  ? ldsem_down_write+0xe0/0x1a8
> [ 2786.483143]  ? __wake_up_common_lock+0xa6/0xcf
> [ 2786.497039]  tty_ldisc_lock+0x16/0x30
> [ 2786.508587]  ? tty_ldisc_lock+0x16/0x30
> [ 2786.520655]  tty_ldisc_hangup+0xbb/0x170
> [ 2786.533000]  __tty_hangup+0x15f/0x21d
> [ 2786.544541]  tty_vhangup_session+0x13/0x15
> [ 2786.557388]  disassociate_ctty+0x51/0x209
> [ 2786.570004]  do_exit+0x43a/0x923
> [ 2786.580262]  ? recalc_sigpending_tsk+0x42/0x49
> [ 2786.594120]  do_group_exit+0x6c/0xa5
> [ 2786.605419]  get_signal+0x46b/0x4b3
> [ 2786.616464]  do_signal+0x37/0x5ed
> [ 2786.626969]  ? list_add+0x34/0x34
> [ 2786.637474]  ? C_SYSC_wait4+0x49/0x99
> [ 2786.649099]  ? handle_mm_fault+0x10f/0x17f
> [ 2786.661968]  prepare_exit_to_usermode+0x94/0xef
> [ 2786.676115]  syscall_return_slowpath+0xb9/0xd9
> [ 2786.690035]  do_fast_syscall_32+0xc3/0xfe
> [ 2786.702897]  entry_SYSENTER_compat+0x4c/0x5b
> [ 2786.716272] RIP: 0023:0xf7f45c29
> [ 2786.726496] RSP: 002b:00000000ffb5d0f0 EFLAGS: 00000246 ORIG_RAX: 
> 0000000000000072
> [ 2786.749827] RAX: fffffffffffffe00 RBX: 00000000ffffffff RCX: 
> 0000000000000000
> [ 2786.772104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> 00000000080504ec
> [ 2786.794087] RBP: 00000000ffb5f638 R08: 0000000000000000 R09: 
> 0000000000000000
> [ 2786.794088] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [ 2786.794088] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> 
> [ 2665.988277] INFO: task bash:5685 blocked for more than 120 seconds.
> [ 2665.988278]       Tainted: G     U  W       
> 4.14.0-amd64-stkreg-sysrq-20171018 #1
> [ 2665.988279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 2665.988281] bash            D    0  5685   5636 0xa0020086
> [ 2665.988284] Call Trace:
> [ 2665.988288]  __schedule+0x4b3/0x5bd
> [ 2665.988291]  schedule+0x89/0x9a
> [ 2665.988293]  schedule_preempt_disabled+0x15/0x1e
> [ 2665.988294]  __mutex_lock.isra.1+0x16d/0x2e0
> [ 2665.988298]  __mutex_lock_slowpath+0x13/0x15
> [ 2665.988300]  ? __mutex_lock_slowpath+0x13/0x15
> [ 2665.988301]  mutex_lock+0x2a/0x2d
> [ 2665.988304]  tty_lock+0x31/0x3c
> [ 2665.988306]  tty_release+0x48/0x53c
> [ 2665.988310]  __fput+0xf0/0x190
> [ 2665.988312]  ____fput+0xe/0x10
> [ 2665.988314]  task_work_run+0x79/0x8c
> [ 2665.988317]  do_exit+0x447/0x923
> [ 2665.988320]  ? recalc_sigpending_tsk+0x42/0x49
> [ 2665.988322]  do_group_exit+0x6c/0xa5
> [ 2665.988323]  get_signal+0x46b/0x4b3
> [ 2665.988327]  do_signal+0x37/0x5ed
> [ 2665.988329]  ? group_send_sig_info+0x4e/0x56
> [ 2665.988331]  ? SYSC_kill+0xa8/0x1b1
> [ 2665.988333]  ? do_sigaction+0xbe/0x18b
> [ 2665.988335]  ? __audit_syscall_entry+0xc2/0xe6
> [ 2665.988338]  prepare_exit_to_usermode+0x94/0xef
> [ 2665.988341]  syscall_return_slowpath+0xb9/0xd9
> [ 2665.988343]  do_fast_syscall_32+0xc3/0xfe
> [ 2665.988345]  entry_SYSENTER_compat+0x4c/0x5b
> [ 2665.988347] RIP: 0023:0xf7f24c29
> [ 2665.988348] RSP: 002b:00000000ffccf9ec EFLAGS: 00000206 ORIG_RAX: 
> 0000000000000025
> [ 2665.988350] RAX: 0000000000000000 RBX: 0000000000001635 RCX: 
> 0000000000000001
> [ 2665.988351] RDX: 0000000000000001 RSI: 00000000080a0310 RDI: 
> 0000000000000000
> [ 2665.988351] RBP: 0000000000000001 R08: 0000000000000000 R09: 
> 0000000000000000
> [ 2665.988352] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [ 2665.988353] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet 
> cooking
> Home page: http://marc.merlins.org/                         | PGP 
> 1024R/763BE901
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to