How it went is not well. I tested the new drive with SeagateTools and it was fine. Then I made a clonezilla live CD and booted from it. It stopped on the first read error with a message saying to restart using the rescue option. I did that. After 5 hours it finished without mentioning any errors.
I tried to boot to the old disk (since it was still wired that way). I got dropped int a maintenance shell with fs errors in /dev/sda4 which is the physical volume for all my LVM logical volumes -- /usr, /var, /home and /temp. It says to run fsck manually. I decided to try the new drive, so I changed the cables and re-booted. Maintenance shell, again. / mounted clean lvm started /home fs has errors run fsck (at this point, I'm afraid to try it) /var, /usr, and /tmp all say that the superblock can not be read, or is invalid. Try running e2fsck -b 8193 <device> or e2fsck -b 32768 <device> Which do I use? How did trying to clone the disk nake such a mess of BOTH disks? Any help getting a working system again will be greatly appreciated. Marc On Feb 6, 2017 2:37 PM, "David Christensen" <dpchr...@holgerdanske.com> wrote: On 02/06/17 13:15, Marc Shapiro wrote: > I am pasting the result of smartctl -x /dev/sda below as I have no real > clue what to do with the information, but I have a few questions first. > > 1) I have purchased a new, very similar, Seagate 1TB drive and I plan to > install it and copy the whole system to the new drive. > It sounds like you don't have a backup of the failing 1 TB drive (?). Do you have a file server with ~1 TB of free space? RAID? Run memtest86+ for 24+ hours to verify that you don't have a memory problem. Use SeaTools to wipe the new 1 TB drive and run the short and long tests. Stop if anything fails. What is the best > way to do this copy since I don't wangt to copy bad sectors? > I've done it with 'dd' in the past, but will use 'ddrescue' in the future. 2) Once I have verified that the new drive boots > I'd do a fresh install on a 16+ GB SSD (USB flash drives also work). A recovered system disk image is too uncertain. and everything is running properly > As I understand it, the drive microcontroller calculates and stores a checksum with every sector (block). That's one way it knows that a block is bad upon reading. So, when you copy out whatever blocks you can get, you probably won't have errors in those blocks. But, files and directories are stored on one or more sectors. Depending upon your file system, fsck may or may not find the missing blocks. When you're done, the destination disk is likely to be missing files and/or directories. I am hoping to reformat the old drive. This should > reallocate the bad sectors IIRC. I then would like to set up a raid > with both drives (keeping a close eye on the old drive).The > feasibility of this, I would guess, depends on what the posted smartctl > information tells someone who knows what to look for. > > 3) As I understand it, the above mentioned raid should be safe since, > even if the old drive deteriorates further, the system can run on just > the new drive. Is that correct? > Once you've copied out whatever blocks you can get, use SeaTools to wipe the old 1 TB drive and run short and long tests. If all three pass, I might be tempted to re-use the drive. If it fails to wipe and has plaintext, destroy it with a sledge hammer. (Wear safety glasses!) If it wipes but fails the short or long tests, recycle it. Here is the smafrtctl output: > ... === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > Interesting, given that the drive failed SeaTools (short test?). General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 121) The previous self-test > completed having > the read element of the test failed. > Matches SeaTools result. Total time to complete Offline > data collection: ( 600) seconds. > ... SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-- 117 095 006 - 165391146 > 3 Spin_Up_Time PO---- 095 093 000 - 0 > 4 Start_Stop_Count -O--CK 100 100 020 - 406 > 5 Reallocated_Sector_Ct PO--CK 072 072 036 - 1181 > 7 Seek_Error_Rate POSR-- 087 060 030 - 656506200 > 9 Power_On_Hours -O--CK 048 048 000 - 46195 > 10 Spin_Retry_Count PO--C- 100 100 097 - 0 > 12 Power_Cycle_Count -O--CK 100 100 020 - 203 > 183 Runtime_Bad_Block -O--CK 092 092 000 - 8 > 184 End-to-End_Error -O--CK 100 100 099 - 0 > 187 Reported_Uncorrect -O--CK 011 011 000 - 89 > 188 Command_Timeout -O--CK 100 097 000 - 51540394008 > 189 High_Fly_Writes -O-RCK 100 100 000 - 0 > 190 Airflow_Temperature_Cel -O---K 070 049 045 - 30 (Min/Max > 27/32) > 194 Temperature_Celsius -O---K 030 051 000 - 30 (0 20 0 > 0 0) > 195 Hardware_ECC_Recovered -O-RC- 034 003 000 - 165391146 > 197 Current_Pending_Sector -O--C- 093 083 000 - 310 > 198 Offline_Uncorrectable ----C- 093 083 000 - 310 > 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 26 > 240 Head_Flying_Hours ------ 100 253 000 - 46718 (49 > 76 0) > 241 Total_LBAs_Written ------ 100 253 000 - 1725386978 > 242 Total_LBAs_Read ------ 100 253 000 - 265479204 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > I have yet to find a good explanation for reading smartctl reports. This post gives some clues: https://ubuntuforums.org/showthread.php?t=2192335 Here are the statistics for my ST3000DM001: Here is my ST3000DM001 for comparison: SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 115 099 006 - 90256224 3 Spin_Up_Time PO---- 094 094 000 - 0 4 Start_Stop_Count -O--CK 100 100 020 - 577 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 063 060 030 - 1955231 9 Power_On_Hours -O--CK 096 096 000 - 3552 10 Spin_Retry_Count PO--C- 100 100 097 - 0 12 Power_Cycle_Count -O--CK 100 100 020 - 576 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error -O--CK 100 100 099 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 188 Command_Timeout -O--CK 100 100 000 - 0 189 High_Fly_Writes -O-RCK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 070 059 045 - 30 (Min/Max 19/30) 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 192 Power-Off_Retract_Count -O--CK 100 100 000 - 35 193 Load_Cycle_Count -O--CK 100 100 000 - 1323 194 Temperature_Celsius -O---K 030 041 000 - 30 (0 17 0 0) 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 240 Head_Flying_Hours ------ 100 253 000 - 269092585999820 241 Total_LBAs_Written ------ 100 253 000 - 2338230420 242 Total_LBAs_Read ------ 100 253 000 - 19882466886 These statistics for your drive look suspicious: Reallocated_Sector_Ct Reported_Uncorrect Runtime_Bad_Block ... SMART Extended Comprehensive Error Log Version: 1 (5 sectors) > Device Error Count: 89 (device log contains only the most recent 20 errors) > That's not good. Mine says: No Errors Logged SMART Extended Self-test Log Version: 1 (1 sectors) > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed: read failure 90% 46194 > This could be SeaTools (?). Let us know how it turns out. David