Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-28 Thread Martin Braure de Calignon
Le vendredi 28 juin 2013 à 02:31 +0100, Ben Hutchings a écrit :
  [...]

 I still suspect a hardware problem, just because ext4 is the default
 filesystem for 'wheezy' and no-one else reported this yet.

That sounds like a reasonable assumption, yeah.

 [...]
 So maybe the second controller (or its driver) is faulty.  Can you try
 using the RAID-1 disks connected to the first controller, with nothing
 connected to the second controller? [...]

So as planned I unplugged the working non RAID1 disk from their
controller, and connect the ext4 RAID1 and the ext3 RAID1 disk to it
(yeah these are 2 powerful RAID1 with 1 device only ;) for testing
purposes).
I also tried to re-plug each PCI card, and connect the video card fan
that was not connected (yeah it was a bad idea to limit the noise level
few years ago).

I did all the tests I could to try to overheat the system (same as
yesterday):
* 4 running dd if=/dev/urandom | gzip /dev/null for the cpu
* massive copy from one disk to the other 
* delete of duplicates between two directories (with many duplicates)

All that in parallel. Everything seems to work fine. No corruption nor
CPU overheating message (yesterday I still had some even after remove
the overclock of the CPU).

However, I now still have two disk and a controller that I would love to
use. How can we go further on this issue?

It seems there's a relation between the PCI card and the CPU messages
(but the video card FAN could be related too)...
for the record, the PCI card is supposed (and was) faster than the
working one (SATA-II). Could there be too much traffic on the PCI bus?
or maybe the card is doing something not expected by the driver? 

Here's the lspci - for this card (if I'm not wrong):

02:09.0 SATA controller: Initio Corporation INI-1623 PCI
SATA-II Controller (rev 02) (prog-if 00 [Vendor
specific])
Subsystem: Initio Corporation Device 1626
Control: I/O+ Mem+ BusMaster+ SpecCycle-
MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr-
DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR-
INTx-
Latency: 32, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 17
Region 0: I/O ports at 9000 [size=256]
Region 1: Memory at ef022000 (32-bit,
non-prefetchable) [size=4K]
[virtual] Expansion ROM at 8000 [disabled]
[size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk+ DSI- D1+ D2+
AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0
DScale=0 PME-
Kernel driver in use: sata_inic162x

(and the website is: http://www.initio.com/Html/INIC-1623TA2.asp)
It was a cheap one, I have to admit it.

the working card one is:
02:0b.0 Mass storage controller: Promise Technology,
Inc. PDC40718 (SATA 300 TX4) (rev 02)
Subsystem: Promise Technology, Inc. PDC40718
(SATA 300 TX4)
Control: I/O+ Mem+ BusMaster+ SpecCycle-
MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr-
DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR-
INTx-
Latency: 88 (1000ns min, 4500ns max), Cache Line
Size: 4 bytes
Interrupt: pin A routed to IRQ 19
Region 0: I/O ports at 9c00 [size=128]
Region 2: I/O ports at a000 [size=256]
Region 3: Memory at ef021000 (32-bit,
non-prefetchable) [size=4K]
Region 4: Memory at ef00 (32-bit,
non-prefetchable) [size=128K]
[virtual] Expansion ROM at 8004 [disabled]
[size=32K]
Capabilities: [60] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2-
AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0
DScale=0 PME-
Kernel driver in use: sata_promise



How can I help going further?

Martin.

-- 
Martin Braure de Calignon


signature.asc
Description: This is a digitally signed message part


Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Martin Braure de Calignon
Package: src:linux
Version: 3.2.46-1
Severity: critical
Tags: upstream
Justification: causes serious data loss

Hi,

I'm experiencing frequent data corruption on my raid1 ext4 fs.
The error is not always the same.

I first thought it was due to a previous resize of the FS I've done.
I had multiple times some message about huge amount of multiply claimed blocks 
in inode .
fsck.ext4 was not working fully and was always ending with a message like FS 
still have error.
I was unable to copy all the files to another FS to save it.
So I end up checking the badblocks (that's where I've been dumb, I choose a non 
data conservative way).
However, badblock was succesful without errors.
So in the end I lost some data, however, I don't know if it's due to the bug or 
the the badblocks check. So feel free to readjust severity.

Since theni, I have bought two brand new disks, created a completly new ext4 
FS, and copied the files that I had succesfully recovered.
Then I run fsck.ext4 on the FS... it seems it is almost working.
I'm remounting the /dev/md0... And each time I start using the system 
seriously, I have new errors, like the one I had today:
(I was just copying files on it)

I'm currently going to move the remaining data I have to an ext3 FS.
However I'm leaving my one of my disks as it is so that we can debug it :)

I don't really know if it could be related to 
https://lkml.org/lkml/2012/10/23/690 or 
http://www.phoronix.com/scan.php?page=news_itempx=MTIxNDQ.

Let me know how I can help more.

[1374596.124050] CPU0: Core temperature above threshold, cpu clock throttled 
(total events = 5921)
[1374596.125031] CPU0: Core temperature/speed normal
[1374825.38] [Hardware Error]: Machine check events logged
[1374896.128061] CPU0: Core temperature above threshold, cpu clock throttled 
(total events = 8319)
[1374896.129092] CPU0: Core temperature/speed normal
[1374975.48] [Hardware Error]: Machine check events logged
[1375196.180050] CPU0: Core temperature above threshold, cpu clock throttled 
(total events = 13674)
[1375196.180960] CPU0: Core temperature/speed normal
[1375200.37] [Hardware Error]: Machine check events logged
[1375588.276057] CPU0: Core temperature above threshold, cpu clock throttled 
(total events = 16548)
[1375588.276909] CPU0: Core temperature/speed normal
[1375725.40] [Hardware Error]: Machine check events logged
[1375888.280046] CPU0: Core temperature above threshold, cpu clock throttled 
(total events = 23033)
[1375888.281129] CPU0: Core temperature/speed normal
[1376175.52] [Hardware Error]: Machine check events logged
[1436849.120036] EXT4-fs (md0): error count: 6
[1436849.120044] EXT4-fs (md0): initial error at 1371763084: 
htree_dirblock_to_tree:587: inode 20971803: block 83894316
[1436849.120054] EXT4-fs (md0): last error at 1371765809: 
htree_dirblock_to_tree:587: inode 41813096: block 167256110
[1446656.923648] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#52698372: block 210773049: comm smbd: bad entry in directory: directory entry 
across blocks - offset=1052(9244), inode=1949184565, rec_len=29816, name_len=24
[1446692.729694] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#59514149: block 238036040: comm smbd: bad entry in directory: directory entry 
across blocks - offset=1416(17800), inode=59514502, rec_len=29816, name_len=0
[1446733.264104] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#51126667: block 204481618: comm find: bad entry in directory: directory entry 
across blocks - offset=2588(14876), inode=1949185078, rec_len=29816, 
name_len=232
[1446733.360346] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#52168818: block 208675851: comm smbd: bad entry in directory: directory entry 
across blocks - offset=2588(6684), inode=1949185587, rec_len=29816, name_len=46
[1446738.872118] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#52168818: block 208675851: comm find: bad entry in directory: directory entry 
across blocks - offset=2588(6684), inode=1949185587, rec_len=29816, name_len=46
[1446741.036110] EXT4-fs error (device md0): ext4_lookup:1050: inode #52436122: 
comm smbd: deleted inode referenced: 52436608
[1446741.048313] EXT4-fs error (device md0): ext4_lookup:1050: inode #52436122: 
comm smbd: deleted inode referenced: 52436607
[1446743.740081] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#59514149: block 238036040: comm find: bad entry in directory: directory entry 
across blocks - offset=1416(17800), inode=59514502, rec_len=29816, name_len=0
[1446745.511445] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#52698372: block 210773049: comm find: bad entry in directory: directory entry 
across blocks - offset=1052(9244), inode=1949184565, rec_len=29816, name_len=24
[1446770.143364] EXT4-fs error (device md0): htree_dirblock_to_tree:587: inode 
#51126667: block 204481618: comm find: bad entry in directory: directory entry 
across blocks 

Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Ben Hutchings
Control: tag -1 moreinfo

On Thu, Jun 27, 2013 at 06:06:06PM +0200, Martin Braure de Calignon wrote:
 Package: src:linux
 Version: 3.2.46-1
 Severity: critical
 Tags: upstream
 Justification: causes serious data loss
 
 Hi,
 
 I'm experiencing frequent data corruption on my raid1 ext4 fs.
 The error is not always the same.
[...]
 [1374596.124050] CPU0: Core temperature above threshold, cpu clock throttled 
 (total events = 5921)
 [1374596.125031] CPU0: Core temperature/speed normal
 [1374825.38] [Hardware Error]: Machine check events logged
[...]

This looks like a hardware fault.  If you install mcelog it will
decode these MCEs (machine check events) into /var/log/messages
which may provide a clue about what's going wrong.

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Martin Braure de Calignon
Le jeudi 27 juin 2013 à 17:51 +0100, Ben Hutchings a écrit :
 [...]
  [1374596.124050] CPU0: Core temperature above threshold, cpu clock 
  throttled (total events = 5921)
  [1374596.125031] CPU0: Core temperature/speed normal
  [1374825.38] [Hardware Error]: Machine check events logged
 [...]
 
 This looks like a hardware fault.  If you install mcelog it will
 decode these MCEs (machine check events) into /var/log/messages
 which may provide a clue about what's going wrong.

Thank you Ben,

So you think that this temperature stuff could be related to my ext4
problems?

Maybe I should just  clean the fan, and add another one, it's a pretty
old PC.

Here's the /var/log/mcelog content:


mcelog: failed to prefill DIMM database from DMI data
Kernel does not support page offline interface
mcelog: mcelog read: No such device
Hardware event. This is not a software error.
MCE 0
CPU 0 THERMAL EVENT TSC 89103c9e1e5ac 
TIME 1372136129 Tue Jun 25 06:55:29 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 1
CPU 0 THERMAL EVENT TSC 89103c9ff355e 
TIME 1372136129 Tue Jun 25 06:55:29 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 2
CPU 0 THERMAL EVENT TSC 89186467abf3c 
TIME 1372136429 Tue Jun 25 07:00:29 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 3
CPU 0 THERMAL EVENT TSC 89186469938b8 
TIME 1372136429 Tue Jun 25 07:00:29 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 4
CPU 0 THERMAL EVENT TSC 89208caa630a8 
TIME 1372136729 Tue Jun 25 07:05:29 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 5
CPU 0 THERMAL EVENT TSC 89208cac53248 
TIME 1372136729 Tue Jun 25 07:05:29 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 6
CPU 0 THERMAL EVENT TSC 91f72ff7d78b0 
TIME 1372219966 Wed Jun 26 06:12:46 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 7
CPU 0 THERMAL EVENT TSC 91f72ff983458 
TIME 1372219966 Wed Jun 26 06:12:46 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 8
CPU 0 THERMAL EVENT TSC 91ff57ab10e84 
TIME 1372220266 Wed Jun 26 06:17:46 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 9
CPU 0 THERMAL EVENT TSC 91ff57acd320e 
TIME 1372220266 Wed Jun 26 06:17:46 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 10
CPU 0 THERMAL EVENT TSC 92077fb398eac 
TIME 1372220566 Wed Jun 26 06:22:46 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 11
CPU 0 THERMAL EVENT TSC 92077fb523454 
TIME 1372220566 Wed Jun 26 06:22:46 2013
Processor 0 below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a software error.
MCE 12
CPU 0 THERMAL EVENT TSC 9212284292302 
TIME 1372220958 Wed Jun 26 06:29:18 2013
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 3 MCGSTATUS 0
MCGCAP c0204 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 15 Model 1
Hardware event. This is not a 

Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Ben Hutchings
On Thu, Jun 27, 2013 at 07:14:57PM +0200, Martin Braure de Calignon wrote:
[...]
 So you think that this temperature stuff could be related to my ext4
 problems?
[...]

Potentially.  As I understand it, current Intel desktop and mobile
processors may not be able to run all cores at full speed continuously
with a standard cooler.  The embedded controller can quickly adjust
the CPU frequency and voltage to keep it from overheating, but it
still depends on a properly functioning cooler.  If it does overheat
then this can certainly result in data corruption.

Are you overclocking the processor?

Ben. 

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Martin Braure de Calignon
Le jeudi 27 juin 2013 à 18:31 +0100, Ben Hutchings a écrit :
 On Thu, Jun 27, 2013 at 07:14:57PM +0200, Martin Braure de Calignon wrote:
 [...]
  So you think that this temperature stuff could be related to my ext4
  problems?
 [...]

 Are you overclocking the processor?
First of all, thank you for your support.

Well I was unsure, so I opened the computer, removed the CPU fan.
Processor is a 1.7Ghz, but in the BIOS the CPU clock was 110 where it
should be 100 to be 1.7Ghz.
So yes it was. It's no longer.
Second problem, my CPU fan was controlled by a potentiometer which was
not turned to it maximum. I just removed it and connected the cpu fan
directly to the mainboard. I'm not sure of that, but I'm not sure there
was enough thermal paste between the processor and the fan. So it is
still a future track if the problem reappeared.
Third problem the PC was full of dust. I just cleaned it.
Fourth problem [1], it seems (on purpose in linux source code) that my
processor is not happy with cpufreq governance to ondemand and was
automatically switched to performance.
I modified the default configuration so that it goes in powersave
mode...


In my original email, and before rebooting, I did a dmesg -T... The time
between the CPU related error and the ext4 error is like  17 hours, so
I'm still not sure it's related.

I just rebooted with the new CPU frequencies, and tried everything I
copy a big directory from my (new) ext3 partition to the raid1 ext4
device (which currently now has only 1 device in it ;))
no problem during like 15 minutes...
So I tried something else, because I'm not sure, but I've noticed that
problem occurred more when I'm deleting files, so I run a fdupes on
another big directory of that raid1 ext4 device.
and boum [2]!




[1]: when I do:
$cpufreq-set -g ondemand
  [3759.734304] ondemand governor failed, too long transition
latency of HW, fall back to performance governor

[2]: 
[...] // lines before are just boot of the system, then the manual mount
of the raid devices and the other devices. I ran a fsck.ext4 forced
right before.

[ 1618.164009] EXT4-fs (md0): mounted filesystem with journalled data
mode. Opts: data=journal
[ 3341.908050] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003610
[ 3342.327029] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003610
[ 3342.456047] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003604
[ 3342.607818] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003604
[ 3342.616049] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003597
[ 3342.616049] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003597
[ 3342.624584] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003587
[ 3342.628041] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003587
[ 3342.628055] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003607
[ 3342.628055] EXT4-fs error (device md0): ext4_lookup:1050: inode
#54003553: comm fdupes: deleted inode referenced: 54003607
[ 3480.764050] EXT4-fs error (device md0): ext4_lookup:1050: inode
#55972184: comm fdupes: deleted inode referenced: 55972483
[ 3480.764050] EXT4-fs error (device md0): ext4_lookup:1050: inode
#55972184: comm fdupes: deleted inode referenced: 55972483
[ 3480.764050] EXT4-fs error (device md0): ext4_lookup:1050: inode
#55972184: comm fdupes: deleted inode referenced: 55972492
[ 3480.764050] EXT4-fs error (device md0): ext4_lookup:1050: inode
#55972184: comm fdupes: deleted inode referenced: 55972492


any clue? :/
Few other comments: I have all other FS under ext4 (but the new ext3 i
created this afternoon) and never got problem with them.
My PC originally had only IDE BUS, so I added two PCI controller for
having SATA.
First two disks on the first controller (a big LVM) never got a problem
with ext4
the other two disks on the second controller (different brand) is the
RAID1 device.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Ben Hutchings
On Fri, 2013-06-28 at 01:28 +0200, Martin Braure de Calignon wrote:
 Le jeudi 27 juin 2013 à 18:31 +0100, Ben Hutchings a écrit :
  On Thu, Jun 27, 2013 at 07:14:57PM +0200, Martin Braure de Calignon wrote:
  [...]
   So you think that this temperature stuff could be related to my ext4
   problems?
  [...]
 
  Are you overclocking the processor?
 First of all, thank you for your support.
 
 Well I was unsure, so I opened the computer, removed the CPU fan.
 Processor is a 1.7Ghz, but in the BIOS the CPU clock was 110 where it
 should be 100 to be 1.7Ghz.
 So yes it was. It's no longer.
 Second problem, my CPU fan was controlled by a potentiometer which was
 not turned to it maximum. I just removed it and connected the cpu fan
 directly to the mainboard. I'm not sure of that, but I'm not sure there
 was enough thermal paste between the processor and the fan. So it is
 still a future track if the problem reappeared.
 Third problem the PC was full of dust. I just cleaned it.
 Fourth problem [1], it seems (on purpose in linux source code) that my
 processor is not happy with cpufreq governance to ondemand and was
 automatically switched to performance.

Right, I guess this is an old Pentium 4 which can't change frequency
very quickly.

[...]
 I just rebooted with the new CPU frequencies, and tried everything I
 copy a big directory from my (new) ext3 partition to the raid1 ext4
 device (which currently now has only 1 device in it ;))
 no problem during like 15 minutes...
 So I tried something else, because I'm not sure, but I've noticed that
 problem occurred more when I'm deleting files, so I run a fdupes on
 another big directory of that raid1 ext4 device.
 and boum [2]!
[...]
 [2]: 
 [...] // lines before are just boot of the system, then the manual mount
 of the raid devices and the other devices. I ran a fsck.ext4 forced
 right before.
 
 [ 1618.164009] EXT4-fs (md0): mounted filesystem with journalled data
 mode. Opts: data=journal
 [ 3341.908050] EXT4-fs error (device md0): ext4_lookup:1050: inode
 #54003553: comm fdupes: deleted inode referenced: 54003610
 [ 3342.327029] EXT4-fs error (device md0): ext4_lookup:1050: inode
 #54003553: comm fdupes: deleted inode referenced: 54003610
 [ 3342.456047] EXT4-fs error (device md0): ext4_lookup:1050: inode
 #54003553: comm fdupes: deleted inode referenced: 54003604
[...]
 any clue? :/

I still suspect a hardware problem, just because ext4 is the default
filesystem for 'wheezy' and no-one else reported this yet.

 Few other comments: I have all other FS under ext4 (but the new ext3 i
 created this afternoon) and never got problem with them.
 My PC originally had only IDE BUS, so I added two PCI controller for
 having SATA.
 First two disks on the first controller (a big LVM) never got a problem
 with ext4
 the other two disks on the second controller (different brand) is the
 RAID1 device.

So maybe the second controller (or its driver) is faulty.  Can you try
using the RAID-1 disks connected to the first controller, with nothing
connected to the second controller?  Or do you need the first two disks
in order to boot?

Ben.

-- 
Ben Hutchings
Knowledge is power.  France is bacon.


signature.asc
Description: This is a digitally signed message part


Bug#714295: linux-image-3.2.0-4-686-pae: ext4 on top of raid1. Frequent and serious data corruption issues

2013-06-27 Thread Martin Braure de Calignon
Le vendredi 28 juin 2013 à 02:31 +0100, Ben Hutchings a écrit :
 [...]
 Can you try
 using the RAID-1 disks connected to the first controller, with nothing
 connected to the second controller?  Or do you need the first two disks
 in order to boot?

I can try that tomorrow yeah :), I don't need them in order to boot.
thanks!


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org