This message is from the T13 list server.
Embedded response below: On 3/21/02, Jim McGrath <[EMAIL PROTECTED]> wrote: >---------------- Begin Message ---------------- >Date: 3/21/02 5:08 PM >Received: 3/21/02 5:49 PM >From: McGrath, Jim, [EMAIL PROTECTED] >To: 'Harlan Andrews', [EMAIL PROTECTED] > McGrath, Jim, [EMAIL PROTECTED] > '[EMAIL PROTECTED]', [EMAIL PROTECTED] > [EMAIL PROTECTED] > >This message is from the T13 list server. > > >Harlan, > >I agree (as I stated in the section of my message that you did not quote). I quoted the entire message at the bottom. >However, that is a result of the drive error rate specification. If you are >not careful you can return data that is in error without an error status >(what we call a "buffer miscompare"). That is EXACTLY what I'm looking for. By introducing a small number of EXTRA error bits I simulate a little EXTRA noise. If that is enough to cause a "buffer miscompare" then I get worried. Remember, I had to add this test to help troubleshoot a REAL problem on a REAL drive which returned "buffer miscompares" because of an error in implementing the ECC algorithm. The drive returned corrupt data even without any extra noise. However, it took a long time and a lot of drives to detect them. With the ECC testing, we were able to demonstrate the problem much more quickly. >These actually will occur (it is the drive miscorrection error specification), but >the rate is specified by vendors to be so low that you should never see it under >normal use. There had better be some margin. How do the drive suppliers know what the miscorrection rate will be if they don't know the raw error rate. This becomes EXTREMELY important with "ECC on the Fly" since many errors are being routinely corrected. The drive will not know when the raw error rate gets bad enough that the probability of miscorrection becomes too high. > >However, none of this has to do with the ATA standard per se. The ATA >standard is entirely silent (as far as I can see) on the topic of defect >management, and auto reallocation in particular. I agree. I wish it were in the standard. It took us THREE generations of drives to get the suppliers to implement "Transparent Auto-Relocation". Now, everyone implements it and most hosts require it. >Indeed, you don't even >need to do defect management to be ATA compliant (some early ATA drives >relied on the host to handle defects). At least those drives ALLOWED the host to relocate bad blocks and to see how many bad blocks were relocated ;-( That was the rub. The suppliers did not want their customers have visibility about the relocated blocks. > >So you should not start inducing errors via WRITE LONGs and assume the drive >will somehow sort it all out - at least not for a drive that just obeys the >normal error rate and ATA standards. Again, I'm simply looking to see if there is any margin. Correctable errors happen all the time. What I'm trying to determine if there is any margin left in the "probability of miscorrection". >Of course a specific product may work >fine in this case, and you could always specify this behavior in a purchase >specification (indeed, some customers do put defect management constraints >into their specifications). But absent that, the ATA standard as written >does not insure that it will work properly. Specs about error rates and relocations are left to the suppliers. However, it would be nice if the ATA committee documented some legitimate method of testing for error rates "In System". It would be very nice to be able to compare the error rate performace of a given drive "outside the host system" with it's performance "inside the host system". > >Running out of spares is actually the least of the worries. Suppose you >corrupt a lot of sectors, and then read them back (triggering errors)? You >could trip all sorts of internal (and external) signals in the drive causing >side effects. SMART triggers have been pointed out as one (a READ of a >sector that was corrupted with a WRITE LONG MUST be logged as an error, >since the READ reported an error - 8.51.6.8.2.4 of ATA-6). Another could be >lowering drive performance (i.e. we could try and slow things down in an >attempt to reduce the number of "excessive" errors we are seeing). >Basically the drive thinks its failing, and so may end up doing a number of >otherwise undesirable things in order to "save" the data. This is paranoid nonsense. I have been writing bad ECC to EVERY block on the drive without any of these bad consequences you speculate. > >This is especially dangerous since a lot of the drive READ/WRITE LONG >implementations have probably been static for a long time, and drives acting >smarter in data reliability issues is more recent. If bad things happen due to a few extra bits of error on a given block then they will also most likely happen in a noiser environment WITHOUT writing any extra bad bits. I'm simply trying to simulate the stricter environment to make sure that some margin remains without data corruption. > >If you are using READ LONG/WRITE LONG in a controlled testing environment, >then this is probably not an issue. But using it for a field feature is >dangerous if you just rely on the ATA standard. The ECC testing is for drive qualification only. We do not routinely run it in the field. Also, we never leave the bad ECC on the drive when it ships. The point is, ReadLong and WriteLong are STILL in use at several companies in spite of the fact that they were "obsoleted" from the ATA standard. The main reason for having an ATA Standard is to document the existing features. Vendor Unique features are bad news. If a given function is useful, it should be documented so that drives which choose to offer that feature can do it in a consistant manner. > >Jim > >-----Original Message----- >From: Harlan Andrews [mailto:[EMAIL PROTECTED]] >Sent: Thursday, March 21, 2002 4:40 PM >To: McGrath, Jim; '[EMAIL PROTECTED]'; [EMAIL PROTECTED] >Subject: RE: [t13] RAID and R/W LONG > > >>To my knowledge once a drive decides to reallocate, that is a non reversible >>decision - you just used up a spare sector on the drive. Do that often >>enough and the drive will fail (there are a limited number of spares). > >Jim, > >I repeat: > >Auto-relocation MUST not take place until valid data is available. >The non-recovered error should go into the "Pending" list (waiting for a >write or a recovered read). Then, when the write occurs, the sector >from the "Pending" list should be tested first before re-assignment. >WriteLong should NEVER cause re-assignment. > >When a "Pending" entry becomes available, there is a TEST of that block >BEFORE relocation. This prevents the relocation of "good" media. > >WriteLong should NEVER cause re-assignment. WriteLong does NOT waste >spare blocks. > >...Harlan > > >---------------- Begin Forwarded Message ---------------- >Date: 3/21/02 3:06 PM >Received: 3/21/02 4:05 PM >From: McGrath, Jim, [EMAIL PROTECTED] >To: '[EMAIL PROTECTED]', [EMAIL PROTECTED] > [EMAIL PROTECTED] > >This message is from the T13 list server. > > > >Raymond, > >You don't understand how auto reallocate works. It has nothing to do with >error reporting. > >When a drive thinks that the media in question is suspect, it "auto >reallocates" the data to another portion of media. If the data was >readable, then the data is moved at that point. If not, then the drive >remembers that the media is suspect and writes the data to the new section >of media when it gets the next write command. > >The drives decision may be correlated to reporting an error to the host, >but >may not be. As an example, a drive could be performing a background scan >of >the media during idle time, run into that sector, and at that time >determine >that the media is suspect. The key is that none of this is standardized. > >To my knowledge once a drive decides to reallocate, that is a non >reversible >decision - you just used up a spare sector on the drive. Do that often >enough and the drive will fail (there are a limited number of spares). > >Jim > > >-----Original Message----- >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] >Sent: Thursday, March 21, 2002 2:12 PM >To: [EMAIL PROTECTED] >Subject: RE: [t13] RAID and R/W LONG > > >This message is from the T13 list server. > > >Logically, the drive should not auto-reallocate when they encounter a read >error, otherwise, the host might read a junk data and get "good status" >back. It is not desirable but acceptable to get a read error (that is why >people use RAID to prevent that), but it is not acceptable that the drive >output the wrong data and tell the host it is good. This is data >corruption >(instead of data error). > >Raymond Liu > >-----Original Message----- >From: McGrath, Jim [mailto:[EMAIL PROTECTED]] >Sent: Thursday, March 21, 2002 1:40 PM >To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED] >Subject: RE: [t13] RAID and R/W LONG > > >This message is from the T13 list server. > > > >The issue on auto reallocation may be that some implementations would auto >reallocate on the subsequent READ of the sector. The drive has no way of >knowing that this is a "good" sector that you artificially forced an error >into. In general the details of auto reallocation policy are all vendor >specific. > >Jim > > >-----Original Message----- >From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] >Sent: Thursday, March 21, 2002 12:13 PM >To: [EMAIL PROTECTED] >Subject: RE: [t13] RAID and R/W LONG > > >This message is from the T13 list server. > > >Creat a false uncorrectable error is only done in the very beginning of >using the drive as RAID1 rebuild target drive (and only if necessary, i.e. >only when the source drive has reported an unrecoverable data block). It >might affect the statistical data the drive collected a little bit (only >the >drive guys can answer this). Auto-relocation should not be affected >because >this is not a normal write error. > >Raymond Liu > >-----Original Message----- >From: Hale Landis [mailto:[EMAIL PROTECTED]] >Sent: Thursday, March 21, 2002 10:02 AM >To: T13 List Server >Subject: [t13] RAID and R/W LONG > > >This message is from the T13 list server. > > >On Thu, 21 Mar 2002 09:18:13 -0800, [EMAIL PROTECTED] wrote: >>This message is from the T13 list server. >>[...] you might implement >>vendor specific commands to "address" that >>(which will keep the R/W Long >>still formally in "obsolete" state)? > >Raymond, I think I asked a few days ago, but could you explain in >detail why/how you are using R/W LONG? Do you expect the command to >actually be passed to a drive behind a RAID controller or is the >command executed directly and only by the RAID controller? If the >command is used to create a false uncorrectable error on a real >drive, how do you then adjust for the possible effects on the drive's >SMART data or the drives auto-relocation function? > > > >*** Hale Landis *** www.ata-atapi.com *** > >----------------- End Message ----------------- > >
