Re: Data corruption in cd9660 on FreeBSD 4.11?

2005-06-28 Thread Stephen McKay
I haven't finished all the suggested tests, but since I'm taking so long
to do so, I thought I should send what I have so far.

On Saturday, 25th June 2005, Peter Jeremy wrote:

On Fri, 2005-Jun-24 22:31:06 +1000, Stephen McKay wrote:
I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11.
...
So, can anyone suggest any more tests I could try?  Or is there a kind of
hardware fault that could cause this substitution of whole blocks read from
CDs without causing any other problems?

You might like to post the relevant sections of a verbose boot - the
ATA and CD probes.

I've appended it to this messages, so that the flow is not ruined.

Note that I am not currently using ATAPI-CAM for my tests.  I am using
/dev/acd0a and /dev/acd1a to mount the CDs in the DVD-ROM and DVD-R
respectively.  Also the non-ATA66 cable thing is true; it is a plain
ATA33 cable.

Are you running the CD/DVD drives in PIO or UDMA modes?

I normally run both DVD drives at UDMA33.  My test runs normally fail
every 2nd or 3rd run.  I've seen it do 5 OK runs in a row once though,
so I don't yet have a very good test.

I tested with PIO4 and ran 12 consecutive test runs without error.  It was
a little slower at 150 seconds per run instead of the normal 135, possibly
because 75% to 80% of the cpu was dedicated to interrupt handling (doing
pio, I assume).

It seems that either DMA or ATA interrupts (or maybe both) are required
to cause the problem.

Also, I tried some tests with the noclusterr mount option on the CD.  The
test ran much slower (approx 232 seconds instead of 135) but I also saw
no failures (with only 6 test runs though as I was pressed for time).

The noclusterr option is interesting because it defeats read clustering
resulting in the ATA driver issuing only 2K reads instead of up to 64K
at a time.  I assume that the 64K reads would require scatter-gather DMA,
so maybe this is relevant to the problem.  Oddly, I noticed that a fixed
value of 65534 is found in atapi-all.c as a request size limit.  No, not
65536 = 2^16, but 2 bytes less.  Puzzling.

Have you tried anything other than ISO9660 filesystems on a physical CD?

I have not tried anything but cd9660 file systems on CDs and DVDs.  I will
see if I can build a UFS file system to test with, when I get a chance.

What happens if you just dd the CD-ROM?

When I dd the CD-ROM it seems to work correctly.  I have done this only
infrequently however, so I may just be lucky to not have had a failure.

I've now done 6 consecutive dd reads of my test CD-ROM in UDMA33 mode with
no errors.  It only takes 125 seconds, so it's a bit faster than comparing
directory trees.  Only 6 tests isn't many, so I'll do more later, this time
with other system activity.

What happens if you use a vnode
mount (see vnconfig(8)) of an ISO filesystem sitting in a UFS filesystem?

I'll test this when I get a chance.

Anything unusual in your kernel config file?

Nothing too unusual.  I'm running a uni-processor kernel with HTT disabled.
I skimmed through my config and this is the only interesting thing: HZ=500
I don't think that's too dangerous.  On the other hand, it does increase
the rate of interrupts, and if there's a race somewhere, it may make it
worse.

Have you tried building a kernel with WITNESS and/or DIAGNOSTIC?

I'm now running with INVARIANTS, INVARIANT_SUPPORT, and DIAGNOSTIC on
4.11.  No change in the failure rate and no significant slowdown either.

Any chance of you repeating the tests with a 5.x system?  Maybe
on a spare small partition or using a 5.4-RELEASE disk1 as a live
filesystem.

I was experimenting with current in late April, so I installed that drive
for testing.  So far, I have not been able to reproduce the failure on
April's current though I've only had time for a quick run of 6 repetitions.

Current takes the same time (135 seconds, on average) to read and compare
the CD.  That seems good, considering all the debugging is still enabled.
I'm pretty sure that ATA MK III is in this kernel.

Sadly, it panics immediately if I run atacontrol mode 1 so I'm just
assuming it is running in DMA mode by the speed of it.  (And I have
hw.ata.atapi_dma=1 in /boot/loader.conf).

That's where I'm up to so far in stress testing.  Right now I'm trying to
understand some unusual looking code in ata_dmasetupd_cb() in 4.11's
ata-dma.c.  The attached comment is A maximum segment size was specified
for bus_dma_tag_create, but some busdma code does not seem to honor this,
so fix up if needed.  The fix-up code seems to be gone in current, so
it looks suspicious to me.  When I work out what it does, I'll report back.

Stephen.

--

Verbose boot of 4.11-p10 (the ata related parts, at least):

atapci0: Intel ICH5 ATA100 controller port 
0xfc00-0xfc0f,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0
ata0: iobase=0x01f0 altiobase=0x03f6 bmaddr=0xfc00
ata0: mask=03 ostat0=50 ostat2=00
ata0-master: ATAPI 00 00

Data corruption in cd9660 on FreeBSD 4.11?

2005-06-24 Thread Stephen McKay
Hi!

I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11.
My best theory so far is that cd9660 or perhaps the VFS layer is mishandling
2048 byte buffers (since they are smaller than one virtual memory page),
occasionally writing them to the wrong location in RAM.  Read on for why
I think so.

First up, I don't think this is the usual hardware problem since the machine
has done huge numbers of buildworlds (in 4.x and -current) without any of
the telltale signs (eg bus errors and segmentation violations).  There are
no error messages in /var/log/messages.  Also, it moonlights as a games
machine and plays Doom 3, Battlefield 1942, Neverwinter Nights and so forth
like a champ.  Memory, cpu, video, disk, networking are all just fine 100%
of the time.

The hardware is an ASUS P4P800 mobo (including onboard Marvell Yukon gigabit
ethernet) with a P4 2.8GHz cpu, 1GB RAM, Maxtor 120GB disk, Pioneer 103S
DVD-ROM, LiteOn SOHW-1673S DVD burner in an Antec Sonata case.

Now that I have a DVD burner, I make backups of my main machines (over NFS)
but have found that they often don't verify as 100% correct.  The symptom
is that, for some files, an entire 2048 DVD sector is replaced with
different (non-zero) data.  This occurs both when reading with the Pioneer
DVD-ROM and when reading with the LiteOn burner (though I don't test with
the Pioneer much as it is slower).

I emphasise that all burns have been 100% correct (ie the burning process
worked and this can be verified by reading on, say, my iBook), so all of
the hardware seems to be operating correctly (and swiftly, I might add).
The problem is that reading the iso9660 file system is not safe.

After some experimenting, I've found that the problem also occurs when
reading CDs, and I built a test CD (of photos of a recent wedding) and in
testing I read this CD over and over.  I compare the CD with the original
files (via NFS) using diff.  When diff finds a difference, I save copies
of the differing files before they can be flushed from the cache.

I have calculated checksums for all 2048 blocks on the CD, so I can know
if any given block of 2048 bytes came from the CD and if so which file it
came from.  In all cases so far, the 2048 byte error has been a block from
another file, not a random corruption.

I am starting to believe that, under high load, the cd9660 file system
code tells the ata driver to put a 2K block in the wrong spot in memory,
leaving some old junk in the gap in the file being read, and blasting some
other 2K block of memory.  It may not be cd9660 code per se that is wrong,
but a bug in the complex buffer handling code (getblk, getnewbuf, allocbuf,
etc).

Why do I believe it is writing to the wrong memory, rather than any number
of other flaws?  In two runs (out of many), unusual things occurred that
are consistent with memory being overwritten, rather than, say, a 2K block
just not being read at all: In one, an innocent sshd core-dumped (which
is something that has never happened except when running my cd9660 tests),
and in another, a previously OK cached NFS file became corrupted.

Explaining that last case further: I had been running a test script that
would mount the CD, compare files, unmount the CD, and repeat.  This meant
that the NFS copy of the files was read over and over and hence became
memory resident (there being enough space in 1GB of RAM for one copy of
the files, plus my normal programs).  Several tests passed without fault
(hence all the NFS files were cached and correct), when suddenly there
were multiple corruptions; call them file A and file B.  File A was the
usual corruption where a 2K block of another file was unexpectedly present
in the copy read from the CD, but in file B it was the NFS file that was
wrong.  In fact it contained the missing block from file A!  In short, the
fully memory resident NFS file B had been corrupted by reading file A from
the CD.

It's been pretty interesting hunting this problem, but now I'm sort of
stuck.  I believe that some 2K reads from DVDs and CDs end up in the wrong
place in RAM, but I can't find where this happens in the code (it's pretty
hard to work out just by reading it), and I can't rule out the possibility
that there's a hardware error here that I've just never run across before.

So, can anyone suggest any more tests I could try?  Or is there a kind of
hardware fault that could cause this substitution of whole blocks read from
CDs without causing any other problems?

And does anyone know of any commits made anywhere in the 5 years since
4.x split off from 5.x that may be relevant?  Yep.  5 years.  I have
started looking, but there's a fair bit of stuff in there...

Stephen.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Data corruption in cd9660 on FreeBSD 4.11?

2005-06-24 Thread Peter Jeremy
On Fri, 2005-Jun-24 22:31:06 +1000, Stephen McKay wrote:
I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11.
...
So, can anyone suggest any more tests I could try?  Or is there a kind of
hardware fault that could cause this substitution of whole blocks read from
CDs without causing any other problems?

You might like to post the relevant sections of a verbose boot - the
ATA and CD probes.

Are you running the CD/DVD drives in PIO or UDMA modes?  In the former,
the CPU is reading the data from the CD and writing it to memory.  In
the latter, the CPU tells the disk controller where to write.  It could
be instructive to change modes and see what happens.

Have you tried anything other than ISO9660 filesystems on a physical CD?
What happens if you just dd the CD-ROM?  What happens if you use a vnode
mount (see vnconfig(8)) of an ISO filesystem sitting in a UFS filesystem?

Anything unusual in your kernel config file?

Have you tried building a kernel with WITNESS and/or DIAGNOSTIC?

Any chance of you repeating the tests with a 5.x system?  Maybe
on a spare small partition or using a 5.4-RELEASE disk1 as a live
filesystem.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]