Re: ad0: FAILURE - WRITE_DMA

2004-10-10 Thread Mikhail P.
On Saturday 09 October 2004 17:01, Mikhail P. wrote:
> I also got another message off-list, where author suggested to play with
> UDMA values. I switched from UDMA100 to UDMA66. System's uptime is 12
> hours, and no timeouts so far.. but I'm quite sure they will get back in
> few days.

1.5 days of uptime, running in UDMA66 changes nothing. Still getting

ad0: FAILURE - READ_DMA status=51 error=10 
LBA=268435455
ad0: FAILURE - READ_DMA status=51 error=10 
LBA=268435455

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Mikhail P.
On Sunday 10 October 2004 23:30, Mikhail P. wrote:
> On Saturday 09 October 2004 17:01, Mikhail P. wrote:
> > I also got another message off-list, where author suggested to play with
> > UDMA values. I switched from UDMA100 to UDMA66. System's uptime is 12
> > hours, and no timeouts so far.. but I'm quite sure they will get back in
> > few days.
>
> 1.5 days of uptime, running in UDMA66 changes nothing. Still getting

Well, now those timeouts popped up on 5.3-BETA7 system with 4 IDE drives.. 
They start appearing with high disk activity.
System had FreeBSD-4.7 prior to that, and has been rock solid for almost a 
year. Drives have no problems, that's for sure (4.7 did not show up any 
timeouts, with uptime for months)..

I don't know what to think - is ATA driver horribly broken in 5.x?

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-13 Thread Mikhail P.
On Wednesday 13 October 2004 13:51, Søren Schmidt wrote:
> Well, thats not up to me to judge I guess, but have you tried to change
> the tripping point for using 48Bit addressing as I suggested earlier ?

How one would do it? In BIOS?
Forgive my ignorance.

> I cant reproduce this problem with any of the shelfmeters of ATA gear I
> have here, so your help is needed or it will stay horribly broken :)

The 5.3-BETA7 box I was referring to is a whole different machine from the one 
I posted initially (2 x 200GB IDE).
This machine has 4 IDE drives -
20GB Seagate
60GB IBM
120GBWDC
200GB WDC

and it is P4 (CPU is 1.5Ghz, p4) motherboard.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-14 Thread Mikhail P.
On Thursday 14 October 2004 19:59, Martin Nilsson wrote:
> I really don't know what to with this box, maybe put regular ATA or SCSI
> disks in it?

Well, there are no problems with SCSI to my knowledge 5.3 and 5.2.1 work well 
on my SCSI servers.. only the ATA driver..
Would be sad to still have these problems when 5.3 goes as -STABLE.. on the 
other hand, I expect more people hitting that problem, and sending more 
debugging information, so that problem gets solved quicker.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


ad0: FAILURE - WRITE_DMA

2004-10-08 Thread Mikhail P.
Hi,

This question probably has been discussed numerous times, but I'm somewhat 
unsure what really causes ATA failures..

I have pretty basic server here which has two IDE drives - each is 200GB. 
System is FreeBSD-5.2.1-p9
That server has been setup about 9 months ago, and just about 3 months ago my 
logs quickly filled up with:
ad0: FAILURE - WRITE_DMA status=51 error=10 
LBA=268435455

Server was still running, but I was unable to write to certain files/folders 
on the drive - whenever I tried to access $HOME/.fetchmailrc, for example, it 
wouldn't read/write the file and system would fire up a message similar to 
above.
After couple reboots, I started getting more and more of these, and server was 
unusable, so I had to shut down all services and mount drives read only to 
backup data from the drives..

At first, I thought, this could be related to poor cooling of the parts, so 
drives could easily overheat in the long run.

After successful backup, I purchased two new drives, with two aluminum drive 
fans. New drives' models were identical to the old ones -
ad0  ATA/ATAPI rev 6
which is Seagate's 200GB drive.

I reloaded OS on the new drives, then restored all data from the old drives. 
All seemed to be fine for 2 months now... but today I woke up, and noticed 
these messages again.

So now the whole situation leads me to a question - is there some issues with 
the ATA driver/system [or filesystem?] on FreeBSD-5.2.1? What can I do to 
stop these frequent failures? How do I diagnose the drives (and see whether 
it is really a hardware issue or something else) remotely (I don't have local 
access to the server - it is sitting overseas)?
It seems to me that if I continue running system as now, I will have these 
failed drives every 1-2 months! It does not sound like a normal situation.

I am running FreeBSD-5.2.1-p9, filesystem is UFS2, and all partitions [except 
for /] have softupdates "on". Kernel is built on GENERIC, with only added 
ipfw options.


regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 15:01, Dag-Erling Smørgrav wrote:
> "Mikhail P." <[EMAIL PROTECTED]> writes:
> > I reloaded OS on the new drives, then restored all data from the old
> > drives. All seemed to be fine for 2 months now... but today I woke up,
> > and noticed these messages again.
>
> A lot of them, or just one or two?  Some ATA drives will spin down at
> regular intervals to recalibrate, and you'll get a harmless timeout if
> you try to write to the disk while it's doing that.

Unfortunately, all the drives (so far - four 200GB drives).
I'm having the previous two drives shipped here within two weeks.
Most likely these drives aren't corrupted actually.. will stress them locally 
here.

>
> DES

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 16:23, Dag-Erling Smørgrav wrote:
> "Mikhail P." <[EMAIL PROTECTED]> writes:
> > On Saturday 09 October 2004 15:01, Dag-Erling Smørgrav wrote:
> > > A lot of them, or just one or two?  Some ATA drives will spin down at
> > > regular intervals to recalibrate, and you'll get a harmless timeout if
> > > you try to write to the disk while it's doing that.
> >
> > Unfortunately, all the drives (so far - four 200GB drives).
>
> I meant "a lot of timeouts", not "a lot of drives".  If you only get
> one or two timeouts per drive at regular intervals (say, once a
> month), they're just recalibrating and there's nothing to worry about.
>

Well, there is no pattern. Often it just happens by itself - system runs 3-10 
days fine (no warnings, no timeouts), and after that time I start seeing lots 
of these. To be more exact, for example I have user who's home dir 
is /home/user; user uses FTP to upload/download files under that directory. 
Let's say he has 5k files in total (ranging in size from 1kb to 20mb), so 
what happens is that when user tries to access certain files (either to 
continue upload, or continue download of the file), system spews lots of 
these timeouts and basically "input/ourput error" occurs. For example, 
yesterday it showed 360 of these messages during 12 hour period, and 
unfortunately during the time I was sleeping system has locked itself - last 
message in /var/log/messages was regarding ad0 failure.
I'm not exactly sure on which files it timed out yesterday, but I do know 
under which directory it happened - directory has 20k files in it (not in the 
single dir, but including subdirs). Maybe someone knows a quick way I could 
open every file in under that directory - this could probably help to 
identify exactly on which file timeouts happened.

Before replacing the drives, I had that server up for 120 days, and it did 
spew these messages (more and more with every day, started on about 90th day 
of uptime count). After rebooting system, it asked for fsck, which I did run, 
but it showed some softupdates inconsistencies, and refused to mount /home in 
rw.

By the way, I just ran fsck on rw mounted /home (that's where those timeouts 
occurred yesterday), and I have attached it's output.

I also got another message off-list, where author suggested to play with UDMA 
values. I switched from UDMA100 to UDMA66. System's uptime is 12 hours, and 
no timeouts so far.. but I'm quite sure they will get back in few days.

> BTW, are you using ataidle or anything similar?

nope, nothing.

>
> DES

regards,
M.
[EMAIL PROTECTED]:/usr/local/etc/rc.d> fsck /home
** /dev/ad0s1g (NO WRITE)
** Last Mounted on /home
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
LINK COUNT FILE I=8715003  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715004  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715005  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715006  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715007  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715008  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715009  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715010  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715016  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715017  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715080  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715086  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715087  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715093  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715094  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715100  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715101  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715107  OWNER=noc MODE=0
SIZE=0 MTIME=Oct  9 09:50 2004  COUNT 0 SHOULD BE -1
ADJUST? no

LINK COUNT FILE I=8715129  OWNER=noc MOD

Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 18:26, Søren Schmidt wrote:
> Hmm, that means that the drive couldn't find the sector you asked for.
> Now, what has me wondering is that it is the exact sector where we
> switch to 48bit adressing mode. Anyhow, I've just checked on the old
> Maxtor preproduktion 48bit reference drive I have here and it crosses
> the limit with no problems.
> What controller are you using ? not all supports 48bit mode correctly..

There's VIA's motherboard (not sure about the model name).

Here's info regarding ata controller from dmesg:
atapci0:  port 0xac00-0xac0f at device 17.1 on 
pci0

I will be able to test the drives (the ones which I thought of as "failed") on 
another board within 10 days or so.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-09 Thread Mikhail P.
On Saturday 09 October 2004 20:53, Dag-Erling Smørgrav wrote:
> "Mikhail P." <[EMAIL PROTECTED]> writes:
> > Well, there is no pattern.  [...]
>
> Could be bad cables, could be bad drives.  Environmental factors are a
> more likely cause, though.  Are all the failing disks in the same
> machine?  If they're in separate machines, are those rack-mount, or
> are they standing on a table or shelf?  If a shelf, what kind?  What's
> the ambient temperature in the machine room?

Could be cables - I will get a replacement to verify that. I'm less sure it is 
drives. Yes, all 4 drives were in the same machine.
Machine is a regular 2U rackmount chassis (one CPU), with proper airflow. Each 
drive has its individual aluminum fan as well. Chassis sits in a 47U cabinet, 
datacenter environment, with lots of free space around. So I'm quite sure it 
is not cooling/dust issues..
Well, unfortunately, I don't have access to hardware myself, so I can't do any 
hardware related tasks. As said, I will get those two drives shipped to me, 
and will then see myself if it is really hdd issue, or something else..

>
> DES

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


question on vinum

2004-10-26 Thread Mikhail P.
Hello,

On our next file server, I want to make one large FTP area out of 4 drives (so 
that system sees them as one volume). Vinum appears to be exactly what I 
need.

I haven't worked with Vinum previously, but hear a lot about it. My question 
is how to implement the above (unite four drives into single volume) using 
Vinum, and what will happen if let's say one drive fails in volume? Am I 
loosing the whole data, or I can just unplug the drive, tell vinum to use 
remaining drives with the data each drive holds? I'm not looking for fault 
tolerance solution.

From my understanding, in above scenario, Vinum will first fill up the first 
drive, then second, etc.

I have read the handbook articles, and I got general understanding of Vinum.
I'm particularly interested to know if I will still be able to use volume in 
case of failed drive.
Some minimal configuration examples would be greatly appreciated!

Server will be running FreeBSD-4.10.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-28 Thread Mikhail P.
On Sunday 10 October 2004 08:59, Søren Schmidt wrote:
> There is definitly something fishy here, since I dont have either the
> disks nor any VIA chips here in the lab I cannot do any testing here.
> However I dont know of any problems with the VIA chips in this regard,
> so that leaves the disks for scrutiny. One thing to try is change the
> tripping point where we switch from 28bit mode to 48 bit mode, could be
> a 1 off error in the firmware...

I apologize for bumping that old thread..
I have received both 200G drives (the ones that were giving me "adX: FAILURE - 
WRITE_DMA" on 5.2.1 system).
I have plugged both drives into running 4.10 system, re-formatted them to UFS1 
from sysinstall. After filling those drives with 180G of data each (files 
ranging in size from 10k to 1G), I did a lot of load on them (e.g. transfered 
data between other drives in the system, deleted random files, "dd", etc) and 
those adX failures did not appear anymore (in fact, I'm running those drives 
on the file server for 5 days now, and there is no single failure/timeout so 
far - system has been very stable all the time on FreeBSD-4.10)

On the side note - I did changes to the tripping point as suggested above and 
re-compiled kernel on 5.2.1 running system - disk operations dramatically 
decreased as expected, but number of timeouts decreased too (per dmesg - 
one-two timeouts in 3-4 days).

I should probably also note another interesting thing - on another system with 
4 hard drives (20G, 60G, 120G, 200G) where I ran RELENG_5 for the past week, 
timeouts and failures were appearing randomly under heavy disk writes.
That system had a mix of filesystems - primary 20G drive had UFS2, and the 
rest of the drives were UFS1 (as they hold data, and I ran 4.7 on that system 
half a year ago) - data transfer between interfaces was horrible, less than 
8-10mb/sec, even when system was IDLE.
After re-installing system to 4.10 (no changes to hardware/etc - all remained 
the same apart from OS), I don't see timeouts/errors anymore, and speed of 
transfers between the drives got back to 20-25mb/sec, that's including that 
system isn't IDLE.

There is also a third system with 2 x 200G ide drives and FBSD-5.2.1. Today, I 
had to transfer approx. 160G of data from one of the drives to another system 
via NFS, and unfortunately some files could not be transfered due to the same 
ad1 failures as above.. I'm going to mount drive in "ro", to finish the 
transfer.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Mikhail P.
On Friday 29 October 2004 16:44, [EMAIL PROTECTED] wrote:
> The same problem with similar IDE Seagate HDD:
>
> ad0:  ATA-6 disk at ata0-master
> ad0: 152627MB (312581808 sectors), 310101 C, 16 H, 63 S, 512 B
> [...]
> ad0: FAILURE - READ_DMA status=51 error=10
> LBA=268435455

Perhaps it is only Seagate <-> FreeBSD5-related. Same drives, but with 
FreeBSD4 do work well together without a glitch.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad0: FAILURE - WRITE_DMA

2004-10-29 Thread Mikhail P.
On Friday 29 October 2004 16:50, Mikhail P. wrote:
> Perhaps it is only Seagate <-> FreeBSD5-related. Same drives, but with
> FreeBSD4 do work well together without a glitch.

Actually not only seagates.. similar happened on a 200GB Western Digital drive 
to me, FreeBSD-5.3.

regards,
M.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"