Re: [lustre-discuss] LustreError on ZFS volumes

Alexander I Kulyavtsev Tue, 13 Dec 2016 13:27:01 -0800

It may worth to do zfs snapshot on ost before there were mass changes on ost, 
to investigate original issue and just in case things get worse if underlying 
zfs metadata are broken.



Did you do scrub pool (snapshot/clone) before migrating files out? It will not 
fix the data, may fix metadata and may point to corruption.


Chris M. published on  LU-5155 the link to his script 
zfsobj2fid<https://raw.githubusercontent.com/chaos/lustre-tools-llnl/1.8/scripts/zfsobj2fid>
 to dump zfs objects and convert to fid. You may take a look on dump generated 
by script and script itself. It may worth to start with checking good known 
file.


Alex.


________________________________
From: lustre-discuss <[email protected]> on behalf of 
Jesse Stroik <[email protected]>
Sent: Tuesday, December 13, 2016 1:15:28 PM
To: Crowe, Tom
Cc: [email protected]
Subject: Re: [lustre-discuss] LustreError on ZFS volumes

We discussed a course of action this morning and decided that we'd start
by migrating the files off of the OST. Testing suggests files that
cannot be completely read will be left on OST0002.

Due to the nature of the corruption - faulty hardware raid controller -
it seems unlikely we'll be able to meaningfully save any files that were
corrupted. This is something we may evaluate more closely once the
lfs_migrate is complete and we have our file list.

We'll then share the list of corrupted files with our users and find out
the cost of the lost data. If it's reasonably reproducible, we'll
reinitialize the RAID array and reformat the vdev.

Thanks for your help, Tom!

Best,
Jesse Stroik



On 12/12/2016 03:51 PM, Crowe, Tom wrote:
> Hi Jessie,
>
> In regards to you seeing 370 objects with errors form ‘zpool status’, but 
> having over 400 files with “access issues”, I would suggest running the 
> ‘zpool scrub’ to identify all the ZFS objects in the pool that are reporting 
> permanent errors.
>
> It would be very important to have a complete list of files w/issues, before 
> replicating the VDEV(s) in question.
>
> You may also want to dump the zdb information for the source VDEV(s) with the 
> following:
>
> zdb -dddddd source_pool/source_vdev > /some/where/with/room
>
> For example, if the zpool was named pool-01, and the VDEV was named 
> lustre-0001 and you had free space in a filesystem named /home:
>
> zdb -dddddd pool-01/lustre-0001 > /home/zdb_pool-01_0001_20161212.out
>
> There is a great wealth of data zdb can share about your files. Having the 
> output may prove helpful down the road.
>
> Thanks,
> Tom
>
>> On Dec 12, 2016, at 4:39 PM, Jesse Stroik <[email protected]> wrote:
>>
>> Thanks for taking the time to respond, Tom,
>>
>>
>>> For clarification, it sounds like you are using hardware based RAID-6, and 
>>> not ZFS raid? Is this correct? Or was the faulty card simply an HBA?
>>
>>
>> You are correct. This particular file system is still using hardware RAID6.
>>
>>
>>> At the bottom of the ‘zpool status -v pool_name’ output, you may see paths 
>>> and/or zfs object ID’s of the damaged/impacted files. This would be good to 
>>> take note of.
>>
>>
>> Yes, I output this to files at a few different times and we've had no chance 
>> since replacing the RAID controller, which makes me feel reasonably 
>> comfortable leaving the file system in production.
>>
>> There are 370 objects listed by zpool status -v but I am unable to access at 
>> least 400 files. Almost all of our files are single stripe.
>>
>>
>>> Running a ‘zpool scrub’ is a good idea. If the zpool is protected with "ZFS 
>>> raid", the scrub may be able to repair some of the damage. If the zpool is 
>>> not protected with "ZFS raid", the scrub will identify any other errors, 
>>> but likely NOT repair any of the damage.
>>
>>
>> We're not protected with ZFS RAID, just hardware raid6. I could run a patrol 
>> on the hardware controller and then a ZFS scrub if that makes the most sense 
>> at this point. This file system is scheduled to run a scrub the third week 
>> of every month so it would run one this weekend otherwise.
>>
>>
>>
>>> If you have enough disk space on hardware that is behaving properly (and 
>>> free space in the source zpool), you may want to replicate the VDEV’s (OST) 
>>> that are reporting errors. Having a replicated VDEV can afford you the 
>>> ability to examine the data without fear of further damage. You may also 
>>> want to extract certain files from the replicated VDEV(s) which are 
>>> producing IO errors on the source VDEV.
>>>
>>> Something like this for replication should work:
>>>
>>> zfs snap source_pool/source_ost@timestamp_label
>>> zfs send -Rv source_pool/source_ost@timestamp_label | zfs receive 
>>> destination_pool/source_oat_replicated
>>>
>>> You will need to set zfs_send_corrupt_data to 1 in 
>>> /sys/module/zfs/parameters or the ‘zfs send’ will error and fail when 
>>> sending a VDEV with read and/or checksum errors.
>>> Enabling zfs_send_corrupt_data allows the zfs send operation to complete. 
>>> Any blocks that are damaged on the source side, will have “x2f5baddb10c” 
>>> replaced in the bad blocks on the destination side. This can be helpful in 
>>> troubleshooting if an entire file is corrupt, or parts of the file.
>>>
>>> After the replication, you should set the replicated VDEV to read only with 
>>> ‘zfs set readonly=on destination_pool/source_ost_replicated’
>>>
>>
>> Thank you for this suggestion. We'll most likely do that.
>>
>> Best,
>> Jesse Stroik
>>
>

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] LustreError on ZFS volumes

Reply via email to