Re: Data corruption in cd9660 on FreeBSD 4.11?
I haven't finished all the suggested tests, but since I'm taking so long to do so, I thought I should send what I have so far. On Saturday, 25th June 2005, Peter Jeremy wrote: On Fri, 2005-Jun-24 22:31:06 +1000, Stephen McKay wrote: I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11. ... So, can anyone suggest any more tests I could try? Or is there a kind of hardware fault that could cause this substitution of whole blocks read from CDs without causing any other problems? You might like to post the relevant sections of a verbose boot - the ATA and CD probes. I've appended it to this messages, so that the flow is not ruined. Note that I am not currently using ATAPI-CAM for my tests. I am using /dev/acd0a and /dev/acd1a to mount the CDs in the DVD-ROM and DVD-R respectively. Also the non-ATA66 cable thing is true; it is a plain ATA33 cable. Are you running the CD/DVD drives in PIO or UDMA modes? I normally run both DVD drives at UDMA33. My test runs normally fail every 2nd or 3rd run. I've seen it do 5 OK runs in a row once though, so I don't yet have a very good test. I tested with PIO4 and ran 12 consecutive test runs without error. It was a little slower at 150 seconds per run instead of the normal 135, possibly because 75% to 80% of the cpu was dedicated to interrupt handling (doing pio, I assume). It seems that either DMA or ATA interrupts (or maybe both) are required to cause the problem. Also, I tried some tests with the noclusterr mount option on the CD. The test ran much slower (approx 232 seconds instead of 135) but I also saw no failures (with only 6 test runs though as I was pressed for time). The noclusterr option is interesting because it defeats read clustering resulting in the ATA driver issuing only 2K reads instead of up to 64K at a time. I assume that the 64K reads would require scatter-gather DMA, so maybe this is relevant to the problem. Oddly, I noticed that a fixed value of 65534 is found in atapi-all.c as a request size limit. No, not 65536 = 2^16, but 2 bytes less. Puzzling. Have you tried anything other than ISO9660 filesystems on a physical CD? I have not tried anything but cd9660 file systems on CDs and DVDs. I will see if I can build a UFS file system to test with, when I get a chance. What happens if you just dd the CD-ROM? When I dd the CD-ROM it seems to work correctly. I have done this only infrequently however, so I may just be lucky to not have had a failure. I've now done 6 consecutive dd reads of my test CD-ROM in UDMA33 mode with no errors. It only takes 125 seconds, so it's a bit faster than comparing directory trees. Only 6 tests isn't many, so I'll do more later, this time with other system activity. What happens if you use a vnode mount (see vnconfig(8)) of an ISO filesystem sitting in a UFS filesystem? I'll test this when I get a chance. Anything unusual in your kernel config file? Nothing too unusual. I'm running a uni-processor kernel with HTT disabled. I skimmed through my config and this is the only interesting thing: HZ=500 I don't think that's too dangerous. On the other hand, it does increase the rate of interrupts, and if there's a race somewhere, it may make it worse. Have you tried building a kernel with WITNESS and/or DIAGNOSTIC? I'm now running with INVARIANTS, INVARIANT_SUPPORT, and DIAGNOSTIC on 4.11. No change in the failure rate and no significant slowdown either. Any chance of you repeating the tests with a 5.x system? Maybe on a spare small partition or using a 5.4-RELEASE disk1 as a live filesystem. I was experimenting with current in late April, so I installed that drive for testing. So far, I have not been able to reproduce the failure on April's current though I've only had time for a quick run of 6 repetitions. Current takes the same time (135 seconds, on average) to read and compare the CD. That seems good, considering all the debugging is still enabled. I'm pretty sure that ATA MK III is in this kernel. Sadly, it panics immediately if I run atacontrol mode 1 so I'm just assuming it is running in DMA mode by the speed of it. (And I have hw.ata.atapi_dma=1 in /boot/loader.conf). That's where I'm up to so far in stress testing. Right now I'm trying to understand some unusual looking code in ata_dmasetupd_cb() in 4.11's ata-dma.c. The attached comment is A maximum segment size was specified for bus_dma_tag_create, but some busdma code does not seem to honor this, so fix up if needed. The fix-up code seems to be gone in current, so it looks suspicious to me. When I work out what it does, I'll report back. Stephen. -- Verbose boot of 4.11-p10 (the ata related parts, at least): atapci0: Intel ICH5 ATA100 controller port 0xfc00-0xfc0f,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0 ata0: iobase=0x01f0 altiobase=0x03f6 bmaddr=0xfc00 ata0: mask=03 ostat0=50 ostat2=00 ata0-master: ATAPI 00 00
Data corruption in cd9660 on FreeBSD 4.11?
Hi! I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11. My best theory so far is that cd9660 or perhaps the VFS layer is mishandling 2048 byte buffers (since they are smaller than one virtual memory page), occasionally writing them to the wrong location in RAM. Read on for why I think so. First up, I don't think this is the usual hardware problem since the machine has done huge numbers of buildworlds (in 4.x and -current) without any of the telltale signs (eg bus errors and segmentation violations). There are no error messages in /var/log/messages. Also, it moonlights as a games machine and plays Doom 3, Battlefield 1942, Neverwinter Nights and so forth like a champ. Memory, cpu, video, disk, networking are all just fine 100% of the time. The hardware is an ASUS P4P800 mobo (including onboard Marvell Yukon gigabit ethernet) with a P4 2.8GHz cpu, 1GB RAM, Maxtor 120GB disk, Pioneer 103S DVD-ROM, LiteOn SOHW-1673S DVD burner in an Antec Sonata case. Now that I have a DVD burner, I make backups of my main machines (over NFS) but have found that they often don't verify as 100% correct. The symptom is that, for some files, an entire 2048 DVD sector is replaced with different (non-zero) data. This occurs both when reading with the Pioneer DVD-ROM and when reading with the LiteOn burner (though I don't test with the Pioneer much as it is slower). I emphasise that all burns have been 100% correct (ie the burning process worked and this can be verified by reading on, say, my iBook), so all of the hardware seems to be operating correctly (and swiftly, I might add). The problem is that reading the iso9660 file system is not safe. After some experimenting, I've found that the problem also occurs when reading CDs, and I built a test CD (of photos of a recent wedding) and in testing I read this CD over and over. I compare the CD with the original files (via NFS) using diff. When diff finds a difference, I save copies of the differing files before they can be flushed from the cache. I have calculated checksums for all 2048 blocks on the CD, so I can know if any given block of 2048 bytes came from the CD and if so which file it came from. In all cases so far, the 2048 byte error has been a block from another file, not a random corruption. I am starting to believe that, under high load, the cd9660 file system code tells the ata driver to put a 2K block in the wrong spot in memory, leaving some old junk in the gap in the file being read, and blasting some other 2K block of memory. It may not be cd9660 code per se that is wrong, but a bug in the complex buffer handling code (getblk, getnewbuf, allocbuf, etc). Why do I believe it is writing to the wrong memory, rather than any number of other flaws? In two runs (out of many), unusual things occurred that are consistent with memory being overwritten, rather than, say, a 2K block just not being read at all: In one, an innocent sshd core-dumped (which is something that has never happened except when running my cd9660 tests), and in another, a previously OK cached NFS file became corrupted. Explaining that last case further: I had been running a test script that would mount the CD, compare files, unmount the CD, and repeat. This meant that the NFS copy of the files was read over and over and hence became memory resident (there being enough space in 1GB of RAM for one copy of the files, plus my normal programs). Several tests passed without fault (hence all the NFS files were cached and correct), when suddenly there were multiple corruptions; call them file A and file B. File A was the usual corruption where a 2K block of another file was unexpectedly present in the copy read from the CD, but in file B it was the NFS file that was wrong. In fact it contained the missing block from file A! In short, the fully memory resident NFS file B had been corrupted by reading file A from the CD. It's been pretty interesting hunting this problem, but now I'm sort of stuck. I believe that some 2K reads from DVDs and CDs end up in the wrong place in RAM, but I can't find where this happens in the code (it's pretty hard to work out just by reading it), and I can't rule out the possibility that there's a hardware error here that I've just never run across before. So, can anyone suggest any more tests I could try? Or is there a kind of hardware fault that could cause this substitution of whole blocks read from CDs without causing any other problems? And does anyone know of any commits made anywhere in the 5 years since 4.x split off from 5.x that may be relevant? Yep. 5 years. I have started looking, but there's a fair bit of stuff in there... Stephen. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Data corruption in cd9660 on FreeBSD 4.11?
On Fri, 2005-Jun-24 22:31:06 +1000, Stephen McKay wrote: I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11. ... So, can anyone suggest any more tests I could try? Or is there a kind of hardware fault that could cause this substitution of whole blocks read from CDs without causing any other problems? You might like to post the relevant sections of a verbose boot - the ATA and CD probes. Are you running the CD/DVD drives in PIO or UDMA modes? In the former, the CPU is reading the data from the CD and writing it to memory. In the latter, the CPU tells the disk controller where to write. It could be instructive to change modes and see what happens. Have you tried anything other than ISO9660 filesystems on a physical CD? What happens if you just dd the CD-ROM? What happens if you use a vnode mount (see vnconfig(8)) of an ISO filesystem sitting in a UFS filesystem? Anything unusual in your kernel config file? Have you tried building a kernel with WITNESS and/or DIAGNOSTIC? Any chance of you repeating the tests with a 5.x system? Maybe on a spare small partition or using a 5.4-RELEASE disk1 as a live filesystem. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]