Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Neil Hoggarth
On 8th October, Mikhail P. [EMAIL PROTECTED] reported the error:

ad0: FAILURE - WRITE_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455

On Sun, 10 Oct 2004, Søren Schmidt wrote:

 so that leaves the disks for scrutiny. One thing to try is change the
 tripping point where we switch from 28bit mode to 48 bit mode, could be
 a 1 off error in the firmware...

This sounds very possible to me. I have been experiencing the same
error, on a system that I've been trying to set up using 5.3-RC1 and
a new 160Gbyte SATA drives My hardware is:

atapci0: SiI 3112 SATA150 controller port 
0xb000-0xb00f,0xac00-0xac03,0xa800-0xa807,0xa400-0xa403,0xa000-0xa007 mem 
0xdf081000-0xdf0811ff irq 18 at device 11.0 on pci1
ad4: 152627MB ST3160023AS/3.18 [310101/16/63] at ata2-master SATA150

(I notice that Michail and I both have Seagate drives ...).

I had problems with a filesystem on a partition which crossed the
LBA=268435455 threshold. After googling and reading this thread and
Søren's posting, I tried removing the filesystem and making a little
1000 sector partition which straddled the lba48 transition sector - I was
able to get read and write failure messages of the above form
reproducibly, by dd-ing between the test partition and /dev/zero.

I edited the /usr/src/sys/dev/ata/ata-lowlevel.c file and reduced the
48-bit trigger level by one:

--- ata-lowlevel.c.orig Fri Oct 29 12:06:09 2004
+++ ata-lowlevel.c  Fri Oct 29 12:05:38 2004
@@ -700,7 +700,7 @@
 ATA_IDX_OUTB(atadev-channel, ATA_ALTSTAT, ATA_A_4BIT);

 /* only use 48bit addressing if needed (avoid bugs and overhead) */
-if ((lba  268435455 || count  256)  atadev-param 
+if ((lba  268435454 || count  256)  atadev-param 
atadev-param-support.command2  ATA_SUPPORT_ADDRESS48) {

/* translate command into 48bit version */

and built a new kernel (I'm using the stock GENERIC configuration).

The resulting kernel was able to dd to and from the test partition
without error. I've now created a new filesystem that uses this part
of the disk and restored the contents from backup, and have been
actively using the filesystem for the last day without observing any
further problems.

Regards,
-- 
Neil HoggarthDepartmental Computing Manager
[EMAIL PROTECTED]   Laboratory of Physiology
http://www.physiol.ox.ac.uk/~njh/  University of Oxford, UK
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread soralx

 This sounds very possible to me. I have been experiencing the same
 error, on a system that I've been trying to set up using 5.3-RC1 and
 a new 160Gbyte SATA drives My hardware is:

 atapci0: SiI 3112 SATA150 controller port
 0xb000-0xb00f,0xac00-0xac03,0xa800-0xa807,0xa400-0xa403,0xa000-0xa007 mem
 0xdf081000-0xdf0811ff irq 18 at device 11.0 on pci1 ad4: 152627MB
 ST3160023AS/3.18 [310101/16/63] at ata2-master SATA150

 (I notice that Michail and I both have Seagate drives ...).

 I had problems with a filesystem on a partition which crossed the
 LBA=268435455 threshold. After googling and reading this thread and
 Søren's posting, I tried removing the filesystem and making a little
 1000 sector partition which straddled the lba48 transition sector - I was
 able to get read and write failure messages of the above form
 reproducibly, by dd-ing between the test partition and /dev/zero.

The same problem with similar IDE Seagate HDD: 

ad0: ST3160023A/3.06 ATA-6 disk at ata0-master
ad0: 152627MB (312581808 sectors), 310101 C, 16 H, 63 S, 512 B
[...]
ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455

It had 312581808 sectors, but failed at = 268435455 :

bash-2.05b# dd if=/dev/ad0 of=/dev/null bs=512 skip=268435453
dd: /dev/ad0: Input/output error
2+0 records in
2+0 records out
1024 bytes transferred in 0.163827 secs (6250 bytes/sec)

bash-2.05b# dd if=/dev/ad0 of=/dev/null bs=512 skip=268435454
dd: /dev/ad0: Input/output error
1+0 records in
1+0 records out
512 bytes transferred in 0.156888 secs (3263 bytes/sec)

bash-2.05b# dd if=/dev/ad0 of=/dev/null bs=512 skip=268435455
dd: /dev/ad0: Input/output error
0+0 records in
0+0 records out
0 bytes transferred in 0.149888 secs (0 bytes/sec)


Decreasing the 48-bit LBA threshold by 1 really helped:

bash-2.05b# dd if=/dev/ad0 bs=512 skip=312581808
0+0 records in
0+0 records out
0 bytes transferred in 0.88 secs (0 bytes/sec)

bash-2.05b# dd if=/dev/ad0 bs=512 skip=312581807
1+0 records in
1+0 records out
512 bytes transferred in 0.019809 secs (25847 bytes/sec)

Timestamp: 0x41826DE9
[SorAlx]  http://cydem.org.ua/
ridin' VN1500-B2

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Mikhail P.
On Friday 29 October 2004 16:44, [EMAIL PROTECTED] wrote:
 The same problem with similar IDE Seagate HDD:

 ad0: ST3160023A/3.06 ATA-6 disk at ata0-master
 ad0: 152627MB (312581808 sectors), 310101 C, 16 H, 63 S, 512 B
 [...]
 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND
 LBA=268435455

Perhaps it is only Seagate - FreeBSD5-related. Same drives, but with 
FreeBSD4 do work well together without a glitch.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Mikhail P.
On Friday 29 October 2004 16:50, Mikhail P. wrote:
 Perhaps it is only Seagate - FreeBSD5-related. Same drives, but with
 FreeBSD4 do work well together without a glitch.

Actually not only seagates.. similar happened on a 200GB Western Digital drive 
to me, FreeBSD-5.3.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Nguyen Tam Chinh
On Fri, 29 Oct 2004, Mikhail P. wrote:

 On Friday 29 October 2004 16:50, Mikhail P. wrote:
  Perhaps it is only Seagate - FreeBSD5-related. Same drives, but with
  FreeBSD4 do work well together without a glitch.

 Actually not only seagates.. similar happened on a 200GB Western Digital drive
 to me, FreeBSD-5.3.


In FreeBSD 5.3b7 I have the same problem with the Maxtor 120GB IDE
ad2: 117246MB Maxtor 6Y120L0/YAR41BW0 [238216/16/63] at ata1-master
UDMA66

ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=14301663
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=160532482
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=209834594
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=218490706
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=211340046
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=209834594
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=163587418
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=209834786
ad2: TIMEOUT - READ_DMA retrying (2 retries left) LBA=17312287

-
With best regards,  |The Power to Serve
Nguyen Tam Chinh|  http://www.FreeBSD.org
Loc: sp.cs.msu.ru   |
http://chinhngt.svmgu.com   |  http://www.gnu.org/copyleft/copyleft.html
Tel: +7 905 7814187 |
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Aaron Glenn
Add Western Digital Raptors to the list as well. However I have not
had a problem since 5.3-BETA3.

aaron.glenn

On Fri, 29 Oct 2004 16:57:33 +, Mikhail P. [EMAIL PROTECTED] wrote:
 On Friday 29 October 2004 16:50, Mikhail P. wrote:
  Perhaps it is only Seagate - FreeBSD5-related. Same drives, but with
  FreeBSD4 do work well together without a glitch.
 
 Actually not only seagates.. similar happened on a 200GB Western Digital drive
 to me, FreeBSD-5.3.
 
 
 
 regards,
 M.
 ___
 [EMAIL PROTECTED] mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to [EMAIL PROTECTED]

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-28 Thread Mikhail P.
On Sunday 10 October 2004 08:59, Søren Schmidt wrote:
 There is definitly something fishy here, since I dont have either the
 disks nor any VIA chips here in the lab I cannot do any testing here.
 However I dont know of any problems with the VIA chips in this regard,
 so that leaves the disks for scrutiny. One thing to try is change the
 tripping point where we switch from 28bit mode to 48 bit mode, could be
 a 1 off error in the firmware...

I apologize for bumping that old thread..
I have received both 200G drives (the ones that were giving me adX: FAILURE - 
WRITE_DMA on 5.2.1 system).
I have plugged both drives into running 4.10 system, re-formatted them to UFS1 
from sysinstall. After filling those drives with 180G of data each (files 
ranging in size from 10k to 1G), I did a lot of load on them (e.g. transfered 
data between other drives in the system, deleted random files, dd, etc) and 
those adX failures did not appear anymore (in fact, I'm running those drives 
on the file server for 5 days now, and there is no single failure/timeout so 
far - system has been very stable all the time on FreeBSD-4.10)

On the side note - I did changes to the tripping point as suggested above and 
re-compiled kernel on 5.2.1 running system - disk operations dramatically 
decreased as expected, but number of timeouts decreased too (per dmesg - 
one-two timeouts in 3-4 days).

I should probably also note another interesting thing - on another system with 
4 hard drives (20G, 60G, 120G, 200G) where I ran RELENG_5 for the past week, 
timeouts and failures were appearing randomly under heavy disk writes.
That system had a mix of filesystems - primary 20G drive had UFS2, and the 
rest of the drives were UFS1 (as they hold data, and I ran 4.7 on that system 
half a year ago) - data transfer between interfaces was horrible, less than 
8-10mb/sec, even when system was IDLE.
After re-installing system to 4.10 (no changes to hardware/etc - all remained 
the same apart from OS), I don't see timeouts/errors anymore, and speed of 
transfers between the drives got back to 20-25mb/sec, that's including that 
system isn't IDLE.

There is also a third system with 2 x 200G ide drives and FBSD-5.2.1. Today, I 
had to transfer approx. 160G of data from one of the drives to another system 
via NFS, and unfortunately some files could not be transfered due to the same 
ad1 failures as above.. I'm going to mount drive in ro, to finish the 
transfer.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-14 Thread Martin Nilsson
Martin Nilsson wrote / skrev:
Something is rotten with ATA on 5.x (or I have a rotten motherboard!)
I have an E7320 Lindenhurst VS 6300ESB box with 2*3GHz EM64T Xeons and 
2*80GB Seagate SATA disks. Sometimes when booting the whole ATA/SATA 
system hangs after two READ_DMA or WRITE_DMA timeout errors. This seems 
to more common when running as AMD64 than i386. I can't remember any 
hangs after the machine have been up nicely for a couple of min.
Today when starting the box with i386 RELENG_5 I got the following:
ad4: TIMEOUT - WRITE_DMA LBA=4798015
ad4: TIMEOUT - WRITE_DMA LBA=146847331
panic: initiate_write_inodeblock_ufs2: already started
After a reboot  fsck it works nice!
A verbose dmesg (from a good boot) is here: 
http://www.gneto.com/FreeBSD/i386-dmesg.boot

I really don't know what to with this box, maybe put regular ATA or SCSI 
disks in it?

/Martin
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-14 Thread Mikhail P.
On Thursday 14 October 2004 19:59, Martin Nilsson wrote:
 I really don't know what to with this box, maybe put regular ATA or SCSI
 disks in it?

Well, there are no problems with SCSI to my knowledge 5.3 and 5.2.1 work well 
on my SCSI servers.. only the ATA driver..
Would be sad to still have these problems when 5.3 goes as -STABLE.. on the 
other hand, I expect more people hitting that problem, and sending more 
debugging information, so that problem gets solved quicker.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Mikhail P.
On Sunday 10 October 2004 23:30, Mikhail P. wrote:
 On Saturday 09 October 2004 17:01, Mikhail P. wrote:
  I also got another message off-list, where author suggested to play with
  UDMA values. I switched from UDMA100 to UDMA66. System's uptime is 12
  hours, and no timeouts so far.. but I'm quite sure they will get back in
  few days.

 1.5 days of uptime, running in UDMA66 changes nothing. Still getting

Well, now those timeouts popped up on 5.3-BETA7 system with 4 IDE drives.. 
They start appearing with high disk activity.
System had FreeBSD-4.7 prior to that, and has been rock solid for almost a 
year. Drives have no problems, that's for sure (4.7 did not show up any 
timeouts, with uptime for months)..

I don't know what to think - is ATA driver horribly broken in 5.x?

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Søren Schmidt
Mikhail P. wrote:
On Sunday 10 October 2004 23:30, Mikhail P. wrote:
On Saturday 09 October 2004 17:01, Mikhail P. wrote:
I also got another message off-list, where author suggested to play with
UDMA values. I switched from UDMA100 to UDMA66. System's uptime is 12
hours, and no timeouts so far.. but I'm quite sure they will get back in
few days.
1.5 days of uptime, running in UDMA66 changes nothing. Still getting

Well, now those timeouts popped up on 5.3-BETA7 system with 4 IDE drives.. 
They start appearing with high disk activity.
System had FreeBSD-4.7 prior to that, and has been rock solid for almost a 
year. Drives have no problems, that's for sure (4.7 did not show up any 
timeouts, with uptime for months)..

I don't know what to think - is ATA driver horribly broken in 5.x?
Well, thats not up to me to judge I guess, but have you tried to change 
the tripping point for using 48Bit addressing as I suggested earlier ?
I cant reproduce this problem with any of the shelfmeters of ATA gear I 
have here, so your help is needed or it will stay horribly broken :)

--
-Søren
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Martin Nilsson
Mikhail P. wrote:
Well, now those timeouts popped up on 5.3-BETA7 system with 4 IDE drives.. 
They start appearing with high disk activity.
System had FreeBSD-4.7 prior to that, and has been rock solid for almost a 
year. Drives have no problems, that's for sure (4.7 did not show up any 
timeouts, with uptime for months)..

I don't know what to think - is ATA driver horribly broken in 5.x?
Something is rotten with ATA on 5.x (or I have a rotten motherboard!)
I have an E7320 Lindenhurst VS ICH5R box with 2*3GHz EM64T Xeons and 
2*80GB Seagate SATA disks. Sometimes when booting the whole ATA/SATA 
system hangs after two READ_DMA or WRITE_DMA timeout errors. This seems 
to more common when running as AMD64 than i386. I can't remember any 
hangs after the machine have been up nicely for a couple of min.

The 1U box is so noisy that I can't be in the apartment at the same time 
without going crazy, this and that I can't reproduce it reliably 
effectively prevents most debugging attempts.

/Martin
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Mikhail P.
On Wednesday 13 October 2004 13:51, Søren Schmidt wrote:
 Well, thats not up to me to judge I guess, but have you tried to change
 the tripping point for using 48Bit addressing as I suggested earlier ?

How one would do it? In BIOS?
Forgive my ignorance.

 I cant reproduce this problem with any of the shelfmeters of ATA gear I
 have here, so your help is needed or it will stay horribly broken :)

The 5.3-BETA7 box I was referring to is a whole different machine from the one 
I posted initially (2 x 200GB IDE).
This machine has 4 IDE drives -
20GB Seagate
60GB IBM
120GBWDC
200GB WDC

and it is P4 (CPU is 1.5Ghz, p4) motherboard.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Aaron Glenn
I used to get that error prior to 5.3-BETA3 (5.2.1-RELEASE, and all
previous 5.3-BETA's). Randomly after reboot the machine would spew
about 100 of these and then hardlock. I've got two identical boxes
running BETA3 and BETA7 without any issues.  Intel 6300ESB controller
and Western Digital Enterprise Serial ATA Raptor drives are the
hardware.

I thought about posting the issue, but decided against it since it was
BETA 1 or BETA 2 and 5.2.1 was, honestly, nothing but pure crap.

Regards,
aaron.glenn

On Sun, 10 Oct 2004 23:30:26 +, Mikhail P. [EMAIL PROTECTED] wrote:
 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND
 LBA=268435455
 ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND
 LBA=268435455
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-10 Thread Mikhail P.
On Saturday 09 October 2004 17:01, Mikhail P. wrote:
 I also got another message off-list, where author suggested to play with
 UDMA values. I switched from UDMA100 to UDMA66. System's uptime is 12
 hours, and no timeouts so far.. but I'm quite sure they will get back in
 few days.

1.5 days of uptime, running in UDMA66 changes nothing. Still getting

ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455
ad0: FAILURE - READ_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Dag-Erling Smørgrav
Mikhail P. [EMAIL PROTECTED] writes:
 I reloaded OS on the new drives, then restored all data from the old drives.
 All seemed to be fine for 2 months now... but today I woke up, and noticed
 these messages again.

A lot of them, or just one or two?  Some ATA drives will spin down at
regular intervals to recalibrate, and you'll get a harmless timeout if
you try to write to the disk while it's doing that.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 15:01, Dag-Erling Smørgrav wrote:
 Mikhail P. [EMAIL PROTECTED] writes:
  I reloaded OS on the new drives, then restored all data from the old
  drives. All seemed to be fine for 2 months now... but today I woke up,
  and noticed these messages again.

 A lot of them, or just one or two?  Some ATA drives will spin down at
 regular intervals to recalibrate, and you'll get a harmless timeout if
 you try to write to the disk while it's doing that.

Unfortunately, all the drives (so far - four 200GB drives).
I'm having the previous two drives shipped here within two weeks.
Most likely these drives aren't corrupted actually.. will stress them locally 
here.


 DES

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Dmitry Morozovsky
On Sat, 9 Oct 2004, Mikhail P. wrote:

MP   I reloaded OS on the new drives, then restored all data from the old
MP   drives. All seemed to be fine for 2 months now... but today I woke up,
MP   and noticed these messages again.
MP 
MP  A lot of them, or just one or two?  Some ATA drives will spin down at
MP  regular intervals to recalibrate, and you'll get a harmless timeout if
MP  you try to write to the disk while it's doing that.
MP
MP Unfortunately, all the drives (so far - four 200GB drives).
MP I'm having the previous two drives shipped here within two weeks.
MP Most likely these drives aren't corrupted actually.. will stress them locally
MP here.

Well, I suppose Dag-Erling means 'lot of errors' as opposed to one or two
raisen sporadically...

Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]

*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- [EMAIL PROTECTED] ***

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Dag-Erling Smørgrav
Mikhail P. [EMAIL PROTECTED] writes:
 On Saturday 09 October 2004 15:01, Dag-Erling Smørgrav wrote:
  A lot of them, or just one or two?  Some ATA drives will spin down at
  regular intervals to recalibrate, and you'll get a harmless timeout if
  you try to write to the disk while it's doing that.
 Unfortunately, all the drives (so far - four 200GB drives).

I meant a lot of timeouts, not a lot of drives.  If you only get
one or two timeouts per drive at regular intervals (say, once a
month), they're just recalibrating and there's nothing to worry about.

BTW, are you using ataidle or anything similar?

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 16:23, Dag-Erling Smørgrav wrote:
 Mikhail P. [EMAIL PROTECTED] writes:
  On Saturday 09 October 2004 15:01, Dag-Erling Smørgrav wrote:
   A lot of them, or just one or two?  Some ATA drives will spin down at
   regular intervals to recalibrate, and you'll get a harmless timeout if
   you try to write to the disk while it's doing that.
 
  Unfortunately, all the drives (so far - four 200GB drives).

 I meant a lot of timeouts, not a lot of drives.  If you only get
 one or two timeouts per drive at regular intervals (say, once a
 month), they're just recalibrating and there's nothing to worry about.


Well, there is no pattern. Often it just happens by itself - system runs 3-10 
days fine (no warnings, no timeouts), and after that time I start seeing lots 
of these. To be more exact, for example I have user who's home dir 
is /home/user; user uses FTP to upload/download files under that directory. 
Let's say he has 5k files in total (ranging in size from 1kb to 20mb), so 
what happens is that when user tries to access certain files (either to 
continue upload, or continue download of the file), system spews lots of 
these timeouts and basically input/ourput error occurs. For example, 
yesterday it showed 360 of these messages during 12 hour period, and 
unfortunately during the time I was sleeping system has locked itself - last 
message in /var/log/messages was regarding ad0 failure.
I'm not exactly sure on which files it timed out yesterday, but I do know 
under which directory it happened - directory has 20k files in it (not in the 
single dir, but including subdirs). Maybe someone knows a quick way I could 
open every file in under that directory - this could probably help to 
identify exactly on which file timeouts happened.

Before replacing the drives, I had that server up for 120 days, and it did 
spew these messages (more and more with every day, started on about 90th day 
of uptime count). After rebooting system, it asked for fsck, which I did run, 
but it showed some softupdates inconsistencies, and refused to mount /home in 
rw.

By the way, I just ran fsck on rw mounted /home (that's where those timeouts 
occurred yesterday), and I have attached it's output.

I also got another message off-list, where author suggested to play with UDMA 
values. I switched from UDMA100 to UDMA66. System's uptime is 12 hours, and 
no timeouts so far.. but I'm quite sure they will get back in few days.

 BTW, are you using ataidle or anything similar?

nope, nothing.


 DES

regards,
M.
[EMAIL PROTECTED]:/usr/local/etc/rc.d fsck /home
** /dev/ad0s1g (NO WRITE)
** Last Mounted on /home
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
LINK COUNT FILE I=8715003  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715004  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715005  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715006  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715007  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715008  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715009  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715010  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715016  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715017  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715080  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715086  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715087  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715093  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715094  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715100  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715101  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715107  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715129  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715142  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715143  

Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Søren Schmidt
Mikhail P. wrote:
Hi,
This question probably has been discussed numerous times, but I'm somewhat 
unsure what really causes ATA failures..

I have pretty basic server here which has two IDE drives - each is 200GB. 
System is FreeBSD-5.2.1-p9
That server has been setup about 9 months ago, and just about 3 months ago my 
logs quickly filled up with:
ad0: FAILURE - WRITE_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455
Hmm, that means that the drive couldn't find the sector you asked for.
Now, what has me wondering is that it is the exact sector where we 
switch to 48bit adressing mode. Anyhow, I've just checked on the old 
Maxtor preproduktion 48bit reference drive I have here and it crosses 
the limit with no problems.
What controller are you using ? not all supports 48bit mode correctly..

--
-Søren
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 18:26, Søren Schmidt wrote:
 Hmm, that means that the drive couldn't find the sector you asked for.
 Now, what has me wondering is that it is the exact sector where we
 switch to 48bit adressing mode. Anyhow, I've just checked on the old
 Maxtor preproduktion 48bit reference drive I have here and it crosses
 the limit with no problems.
 What controller are you using ? not all supports 48bit mode correctly..

There's VIA's motherboard (not sure about the model name).

Here's info regarding ata controller from dmesg:
atapci0: VIA 8235 UDMA133 controller port 0xac00-0xac0f at device 17.1 on 
pci0

I will be able to test the drives (the ones which I thought of as failed) on 
another board within 10 days or so.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Dag-Erling Smørgrav
Mikhail P. [EMAIL PROTECTED] writes:
 Well, there is no pattern.  [...]

Could be bad cables, could be bad drives.  Environmental factors are a
more likely cause, though.  Are all the failing disks in the same
machine?  If they're in separate machines, are those rack-mount, or
are they standing on a table or shelf?  If a shelf, what kind?  What's
the ambient temperature in the machine room?

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 20:53, Dag-Erling Smørgrav wrote:
 Mikhail P. [EMAIL PROTECTED] writes:
  Well, there is no pattern.  [...]

 Could be bad cables, could be bad drives.  Environmental factors are a
 more likely cause, though.  Are all the failing disks in the same
 machine?  If they're in separate machines, are those rack-mount, or
 are they standing on a table or shelf?  If a shelf, what kind?  What's
 the ambient temperature in the machine room?

Could be cables - I will get a replacement to verify that. I'm less sure it is 
drives. Yes, all 4 drives were in the same machine.
Machine is a regular 2U rackmount chassis (one CPU), with proper airflow. Each 
drive has its individual aluminum fan as well. Chassis sits in a 47U cabinet, 
datacenter environment, with lots of free space around. So I'm quite sure it 
is not cooling/dust issues..
Well, unfortunately, I don't have access to hardware myself, so I can't do any 
hardware related tasks. As said, I will get those two drives shipped to me, 
and will then see myself if it is really hdd issue, or something else..


 DES

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


ad0: FAILURE - WRITE_DMA

2004-10-08 Thread Mikhail P.
Hi,

This question probably has been discussed numerous times, but I'm somewhat 
unsure what really causes ATA failures..

I have pretty basic server here which has two IDE drives - each is 200GB. 
System is FreeBSD-5.2.1-p9
That server has been setup about 9 months ago, and just about 3 months ago my 
logs quickly filled up with:
ad0: FAILURE - WRITE_DMA status=51READY,DSC,ERROR error=10NID_NOT_FOUND 
LBA=268435455

Server was still running, but I was unable to write to certain files/folders 
on the drive - whenever I tried to access $HOME/.fetchmailrc, for example, it 
wouldn't read/write the file and system would fire up a message similar to 
above.
After couple reboots, I started getting more and more of these, and server was 
unusable, so I had to shut down all services and mount drives read only to 
backup data from the drives..

At first, I thought, this could be related to poor cooling of the parts, so 
drives could easily overheat in the long run.

After successful backup, I purchased two new drives, with two aluminum drive 
fans. New drives' models were identical to the old ones -
ad0 ST3200822A/3.01 ATA/ATAPI rev 6
which is Seagate's 200GB drive.

I reloaded OS on the new drives, then restored all data from the old drives. 
All seemed to be fine for 2 months now... but today I woke up, and noticed 
these messages again.

So now the whole situation leads me to a question - is there some issues with 
the ATA driver/system [or filesystem?] on FreeBSD-5.2.1? What can I do to 
stop these frequent failures? How do I diagnose the drives (and see whether 
it is really a hardware issue or something else) remotely (I don't have local 
access to the server - it is sitting overseas)?
It seems to me that if I continue running system as now, I will have these 
failed drives every 1-2 months! It does not sound like a normal situation.

I am running FreeBSD-5.2.1-p9, filesystem is UFS2, and all partitions [except 
for /] have softupdates on. Kernel is built on GENERIC, with only added 
ipfw options.


regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]