Re: [oi-dev] SSD-based pools

Jim Klimov Wed, 01 Oct 2014 04:12:49 -0700

30 сентября 2014 г. 22:39:00 CEST, Bob Friesenhahn 
<[email protected]> пишет:
>On Tue, 30 Sep 2014, Schweiss, Chip wrote:
>
>> 
>> 
>> On Tue, Sep 30, 2014 at 11:52 AM, Bob Friesenhahn
><[email protected]> wrote:
>>
>>             Presumably because the checksum is wrong.
>> If by turning off 'sync' it is meant that the zil is disabled, then
>that has nothing to do with zfs checksums being
>> wrong.  If drive cache flush is disabled for async transaction
>groups, then nothing but problems can result (e.g.
>> failure to import the pool at all).
>> 
>> 
>> I doubt the pool would ever not be importable.  Data loss sure.  ZFS
>will be rolled back to the last completed TXG.  Like I
>> said before, on this pool data loss is not an issue as long as we
>know it's lost.   Losing the entire pool because a power
>> failure is not an issue.   All the processing pipelines using the
>pool at the time would have lost power too and would be
>
>Obviously it does happen since there have been reports to the zfs 
>mailing list over the years about pools which were completely lost 
>because the drive firmware did not honor cache flush requests.
>
>There are only so many (20?) TXG records which are discoverable.
>
>This has nothing to do with zfs tunables though.  Zfs always issues 
>drive cache flush requests for each TXG since otherwise the pool would 
>be insane.
>
>Bob


> There are only so many (20?) TXG records which are discoverable.

128KB in a ring buffer of ZFS structure 'roots', 32 items on 4kb sectored 
drives, 256 on 512b drives. However there is no guarantee that all of an older 
tree will be consistent (not overwritten if the block was released and reused 
in newer transactions that are now ignored due to rollback). Statistically, the 
older a TXG is, the more likely are such sorts of corruptions upon rollback.

Note that metadata is often written with 2 or 3 copies (beside the 
raidzN/mirror redundancy) so it is more likely to remain intact (at least one 
consistent ditto copy), while userdata is typically single-copy and might thus 
suffer more likely.

Regarding other fatal faults, I've had a raidz2 system made of consumer 
hardware where who knows what voodoo was happening, but over time previously 
fine blocks were becoming unrecoverable garbage. That is, either 'all' or 
'sufficiently many' of the 6 disks (a 4+2 set) were corrupted in the same 
offsets. Maybe some signal noise was misinterpreted by all 6 of the firmwares 
as write-commands. Maybe partitioning the disks in a way that all zfs slices 
had different sector-offsets could help in this case - if this one of many 
random guesses of a reason is correct...

In the same manner the pool had errors not only in named files, but also 
'metadata:<0x...>' items which might or not have been problematic. In my 
experience i've had unimportable pools - but mostly on virtual systems that 
lied about cache-flushing and/or did IOs out of order during a kernel/power 
failure.

More often there were time-consuming operations (like large deletions on/of a 
deduped dataset) that were 'backgrounded' while a pool was 'live', and became 
foreground prerequisites for a pool import after a reboot. And these could take 
days to complete. And many reboots if the operation required more RAM than the 
system had...

HTH,
Jim
--
Typos courtesy of K-9 Mail on my Samsung Android

_______________________________________________
oi-dev mailing list
[email protected]
http://openindiana.org/mailman/listinfo/oi-dev

Re: [oi-dev] SSD-based pools

Reply via email to