Hello,
We recently changed our SSD and on a power cycle under load we encountered
file corruption in many files on the nilfs partition. Some i/o processing
was occurring when we encountered the file corruption (possibly on a power
cycle). I did check the SSD with smart tools and no errors seem to have been
logged.
Here is the output of dmesg(1) on a mount and a read access of one of the
affected files:
NILFS nilfs_fill_super: start(silent=0)
NILFS warning: mounting unchecked fs
NILFS(recovery) nilfs_search_super_root: found super root: segnum=1824,
seq=2164205, pseg_start=3737132, pseg_offset=1613
NILFS: recovery complete.
segctord starting. Construction interval = 5 seconds, CP frequency < 30
seconds
NILFS warning: mounting fs with errors
NILFS nilfs_fill_super: mounted filesystem
attempt to access beyond end of device
sda3: rw=0, want=9331952664445036944, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=8178943302301875000, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=11216631730677685184, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=7566444304283562424, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=7161651204113109256, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=5463845364605981576, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=8888728157704767904, limit=49930020
attempt to access beyond end of device
sda3: rw=0, want=9331952664445036944, limit=49930020
The output of nilfs-tune is:
nilfs-tune -l /dev/sda3
nilfs-tune 2.1.0
Filesystem volume name: /writable
Filesystem UUID: 11d71018-2c18-42ad-a842-f475e6b1c449
Filesystem magic number: 0x3434
Filesystem revision #: 2.0
Filesystem features: (none)
Filesystem state: invalid or mounted,error
Filesystem OS type: Linux
Block size: 4096
Filesystem created: Mon Jul 11 17:21:39 2011
Last mount time: Tue Sep 11 10:41:22 2012
Last write time: Tue Sep 11 10:41:22 2012
Mount count: 972
Maximum mount count: 50
Reserve blocks uid: 0 (user root)
Reserve blocks gid: 0 (group root)
First inode: 11
Inode size: 128
DAT entry size: 32
Checkpoint size: 192
Segment usage size: 16
Number of segments: 3047
Device size: 25564170240
First data block: 1
# of blocks per segment: 2048
Reserved segments %: 5
Last checkpoint #: 2555913
Last block address: 3737132
Last sequence #: 2164205
Free blocks count: 5777408
Commit interval: 0
# of blks to create seg: 0
CRC seed: 0xb9934a73
CRC check sum: 0x1f5cb561
CRC check data size: 0x00000118
The problem initially appeared a few days ago possibly on the power cycle and
it seems as if has been growing. The first error in /var/log/messages (btw,
even the messages.1 file was corrupted in the middle) was in this directory
which gives I/O error on any readdir of the directory:
NILFS error (device sda3): nilfs_check_page: bad entry in directory
#10778: rec_len is smaller than minimal - offset=0,
inode=733085696, rec_len=0, name_len=193
NILFS error (device sda3): nilfs_readdir: bad page in #10778
Then this same directory later gave the same error and this also later on:
NILFS error (device sda3): nilfs_check_page: bad entry in directory
#10778: directory entry across blocks - offset=0, inode=1346725220,
rec_len=24320, name_len=90
NILFS error (device sda3): nilfs_readdir: bad page in #10778
[<c04c2fbc>] nilfs_btree_do_lookup+0xa9/0x234
[<c04c2fdf>] nilfs_btree_do_lookup+0xcc/0x234
[<c04c441d>] nilfs_btree_lookup_contig+0x54/0x349
[<f88634d8>] scsi_done+0x0/0x16 [scsi_mod]
[<f88df964>] ata_scsi_translate+0x107/0x12c [libata]
[<f88634d8>] scsi_done+0x0/0x16 [scsi_mod]
[<f88e20ae>] ata_scsi_queuecmd+0x18f/0x1ac [libata]
[<f88e20c3>] ata_scsi_queuecmd+0x1a4/0x1ac [libata]
[<c04f6ca4>] elv_next_request+0x127/0x134
[<c04c29a3>] nilfs_bmap_lookup_contig+0x31/0x43
[<c04bd214>] nilfs_get_block+0xb9/0x227
[<c04f6d78>] elv_insert+0xc7/0x160
[<c0495970>] do_mpage_readpage+0x2a4/0x5fd
[<c04bd15b>] nilfs_get_block+0x0/0x227
[<c0458ba8>] find_lock_page+0x1a/0x7e
[<c045b314>] find_or_create_page+0x31/0x88
[<c04c0a62>] __nilfs_get_page_block+0x70/0x8a
[<c04c1171>] nilfs_grab_buffer+0x53/0x11a
[<c0458d64>] add_to_page_cache+0x91/0xa2
[<c0495da9>] mpage_readpages+0x82/0xb6
[<c04bd15b>] nilfs_get_block+0x0/0x227
[<c045d2c9>] __alloc_pages+0x69/0x2cf
[<c04bc651>] nilfs_readpages+0x0/0x15
[<c045e800>] __do_page_cache_readahead+0x11d/0x183
[<c04bd15b>] nilfs_get_block+0x0/0x227
[<c045e8ac>] blockable_page_cache_readahead+0x46/0x99
[<c045ea3f>] page_cache_readahead+0xb3/0x178
[<c0459270>] do_generic_mapping_read+0xb8/0x380
[<c0459daa>] __generic_file_aio_read+0x16a/0x1a3
[<c045887d>] file_read_actor+0x0/0xd5
[<c0459e1e>] generic_file_aio_read+0x3b/0x42
[<c0475b83>] do_sync_read+0xb6/0xf1
[<c0476cbb>] file_move+0x27/0x32
[<c043607b>] autoremove_wake_function+0x0/0x2d
[<c0475acd>] do_sync_read+0x0/0xf1
[<c047645c>] vfs_read+0x9f/0x141
[<c04768aa>] sys_read+0x3c/0x63
[<c0404f17>] syscall_call+0x7/0xb
=======================
NILFS: btree level mismatch: 36 != 1
Later we get corruption in many more files and directories on the nilfs
partition, many with different errors & stack traces.
Has anybody seen these errors and then worked around them? If so, can you
please let me know how. Any thoughts on whether this is an SSD issue or is this
is a nilfs bug? If it is a nilfs bug, have things been fixed in the newer
kernel module. Thanks a lot.
Zahid
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html