Am Do  24. Juli 2008 schrieb Andy Green:
> Somebody in the thread at some point said:
> | Hi all,
> |
> | I can now reliably reproduce the issue, as dd'ing the mbr back to the
> | card so far restores sane behaviour :
> |
> | If sd_drive is set to "0", then after a resume from "sync && apm -s" the
> | MBR of my 4GB SanDisk is wiped - so far I haven't noticed any other
> | errors, but have not looked very closely.
> ...
> | PS: Can somebody please tell me how to re-initialize the card without
> | going through another suspend/resume cycle ?
> sd_drive setting isn't actually used until next time we access the card,
> so provoking an access will do it, eg, touch /something ; sync.
> But the two explanations for what goes on seem mixed still here, we
> affect sd_drive and we do a suspend.  My guess / hope is that this
> problem is coming from the suspend action alone and the change of
> sd_drive is bogus here.  Maybe you can bang on it a little more trying
> to disprove that hypothesis?
> -Andy

As I think this seems to be quite a good clue to what's really happening here,
quote from the OLPC ticket #6532:
cc dilinger added 
 I've spend some time digging deep into the bowels of the VFS and block layer 
and gathering some debug output and have an explanation for the partition 
table corruption: 
 Upon coming out of resume, the SD code, with CONFIG_MMC_UNSAFE_SUSPEND 
enabled, checks to see if there is a card plugged into the system and whether 
that card is the same as the one that was plugged into the system at suspend 
time. This is accomplished by reading the card ID of the device and for some 
reason, very possibly #1339, we fail this detection. In this case, the kernel 
removes the old device from the system and in this execution path, the 
partition information for this device is zeroed. 
 Even though the device is removed, the device is still mounted and upon 
unmount, ext2 syncs the superblock, even if the file system is sync'd 
beforehand. The superblock is block 0 of the partition and the block layer 
adds to this the partition start offset before submitting the write to the 
lower layers. As the partition information has already been zeroed out, we 
end up writing to block 0 of the disk itself, overwriting the partition table 
and the geometry information. I've verified this by both gathering debug 
output and 'dd' + 'hexdump' of corrupted and uncorrupted media. 
 Some interesting points: 
We are able to delete a block device even though it is still mounted. 
Even though the device has been deleted, the write submitted to it does not 
 Note that this is still not 100% reproducible and in certain cases the 
superblock write during unmount does fail with block I/O errors, meaning that 
the queue is properly deleted. As per dilinger's comments on IRC, the VFS has 
lots of refcounts and there is a timing issue/race condition that we're 
hitting. As per #1339, we may be able to add an OLPC specific hackto wait 
500ms or so upon resume to get around this. I will try this but I don't think 
this is acceptable given our suspend/resume requirements. 
 Something I don't quite understand at the moment is how/when our userland env 
(journal specifically I think?) unmounts the device as I've been testing via 
command line suspend mount, and unmount while running in console mode. 
 Next steps: 
Get an understanding of the what is happening with our userland and brainstorm 
with cjb about the possibility of simply unmounting the SD device upon 
suspend. There are issues around this as we may have files open and that will 
keep us from suspending. 
Test adding a timeout to the resume path to see if it solves our problem to 
validate that it is indeed something related to our HW. 
Dig into the unmount/write to non-existing bdev some more nad discuss this 
upstream if needed. 
 (Adding dilinger to cc:) 

Attachment: signature.asc
Description: This is a digitally signed message part.

Openmoko community mailing list

Reply via email to