Quoting Vali Dragnuta <[email protected]>:

> Cred ca ce vrea el sa spuna este ca avind checksumuri (si cred ca si
> ceva error correction code) pentru fiecare unitate de date, la fiecare
> citire standard (as in citire din timpul utilizarii NORMALE) ai
> posibilitatea sa iti dai seama relativ usor daca datele citite sint in
> regula sau nu (si daca ai si ceva redundanta sa si corectezi situatia).

Exact. Verificare se face automat la utilizare normala, fara sa-i faci  
ceva special.

> Lucrul asta la raid1 poate fi ceva mai dificil, pentru ca majoritatea
> implementarilor la citirea unui bloc de date nu fac de fapt citire de pe
> ambele discuri ca sa si compare daca blocurile citite sint sau nu
> identice.

Exact si asta.

> Nu ca nu s-ar putea, dar nu se face mereu.
> Scrubul ala de zfs probabil ca forteaza aceasta verificare la nivelul
> intregului filesystem, dar nu depinzi de el pt detectia erorilor.

Exact.

>
> De asemenea, este adevarat ca pentru fiecare bloc lowlevel in hard disc
> exista un alt checksum care este validat de controllerul intern al
> discului si eventual semnalat/realocat, dupa caz. Dar asta nu te
> protejeaza decit de erorile care au aparut pe suprafata magnetica
> ulterior momentului scrierii.
>
> Exista destule lucruri ingrijoratoare la zfs, precum cit de usor este ca
> atunci cind chiar ti s-au corupt discurile peste un anumit nivel cit de
> usor este sa repari filesystemul cu fsck (si eventual sa ramii si cu
> ceva date recuperate).

NU are fsck. Tot ce scrie pe disk se scrie in tranzactii. Dupa ce datele au
ajuns pe platane, se adauga metadatele. E de tip CopyOnWrite. Am  
testat pe un HDD pe USB. Scris/modificat/redenumit fisiere, oprit  
curentul la usb, pornit, verificat. Cateva ore in total, datele erau  
cum trebe. Ce nu apuca sa scrie complet, nu apare in FS.

> Ext3/4 este relativ robust din punctul asta de
> vedere, ai sanse mari ca de pe un volum plin de erori logice sa
> restaurezi o cantitate semnificativa de date. Nu stiu cit de usor e
> lucrul asta pe un setup mai complicat de zfs atunci cind "shit hits the
> fan". Dar la capitolul utilitate checksumuri pt ce salveaza si validare
> checksum la fiecare citire de unitati de pe disc versus a nu avea deloc,
> e clar ca e mai bine sa le ai. Ba mai mult, chiar si aplicatiile de
> deasupra vin in layerul lor si pentru blocurile lor isi fac alt checksum
> (bazele de date fiind un exemplu).
>
> Discutia versus "nu ne trebuie asigurari suplimentare ca noi punem doar
> cabluri bune" mi se pare un pic puerila, atita vreme cit validarea unui
> checksum la citire e o operatie cu cost minim nu poti argumenta ca nu e
> bine sa ai SI asa ceva, pentru ca tu ai cabluri bune si discuri
> enterprise. Chiar si alea dau rateuri, doar ca cu probabilitate mai
> mica.
>
Corect.

> Culmea e ca eu la rindul meu sint zfs sceptic, atit pentru motivul de
> mai sus referitor la sansele de recovery atunci cind chiar se duce in
> balarii cit si pentru ca nu-l pot folosi "normal" sau  "oficial" pe
> kerneluri linux (sa stau sa-l builduiesc de mina nu mi se pare
> convenabil).

  Folosesc de 2 ani zfs-fuse, pe 7-8 masini(samba, mail-ri, squid,  
gluster, openvz - toate peste zfs, cu file system zfs, sau zfs exporta  
un block-device formatat de ex. ext4) Eu nu am avut probleme, nu am  
vazut rateuri. De juma de
an folosesc modul de kernel pe cam 6 masini si un laptop. Cam pe toate  
distributiile(centos/redhat,debian/ubuntu), ai rmp-ri/deb-ri care se  
recompileaza automat via dkms. Iarasi nici aici nu am avut probleme.  
Nici pe debian nici pe centos si nici pe ubuntu.

> Dar trebuie a recunosc ca cel putin la capitolul asta cu
> validarea datelor citite are un avantaj, si mi se pare o prostie sa
> argumentezi ca un sistem de securitate in plus este prost pentru ca tu
> tii la datele tale si ai cabluri bune si hardware de top.
>

Exact la asta m-am gandit si eu.

Wikipedia:

"ZFS is a combined file system and logical volume manager designed by  
Sun Microsystems. The features of ZFS include protection against data  
corruption, support for high storage capacities, efficient data  
compression, integration of the concepts of filesystem and volume  
management, snapshots and copy-on-write clones, continuous integrity  
checking and automatic repair, RAID-Z and native NFSv4 ACLs.

Data integrity
One major feature that distinguishes ZFS from other file systems is  
that ZFS is designed with a focus on data integrity. That is, it is  
designed to protect the user's data on disk against silent data  
corruption caused by bit rot, current spikes, bugs in disk firmware,  
phantom writes (the write is dropped on the floor), misdirected  
reads/writes (the disk accesses the wrong block), DMA parity errors  
between the array and server memory or from the driver (since the  
checksum validates data inside the array), driver errors (data winds  
up in the wrong buffer inside the kernel), accidental overwrites (such  
as swapping to a live file system), etc.

Error rates in hard disks
A modern hard disk devotes a significant portion of its capacity to  
the error detection data. For example, a typical 1 TB hard disk with  
512-byte sectors, also provides additional capacity of about 93 GB for  
the ECC data.[55] Many errors occur during normal usage, but are  
corrected by the disk's firmware, and thus are not visible to the host  
software. Only a tiny fraction of the detected errors ends up as not  
correctable.

For example, specification for an enterprise SAS disk (a model from  
2013) estimates this fraction to be one uncorrected error in every  
10exp16 bits,[56] and another SAS enterprise disk from 2013 specifies  
similar error rates.[57] Another modern (as of 2013) enterprise SATA  
disk specifies an error rate of less than 10 non-recoverable read  
errors in every 10exp16 bits.[58] An enterprise disk with a Fibre  
Channel interface, which uses 520 byte sectors to support the Data  
Integrity Field standard to combat data corruption, specifies similar  
error rates in 2005.[59]

ZFS data integrity

For ZFS, data integrity is achieved by using a (Fletcher-based)  
checksum or a (SHA-256) hash throughout the file system tree.[70] Each  
block of data is checksummed and the checksum value is then saved in  
the pointer to that block—rather than at the actual block itself.  
Next, the block pointer is checksummed, with the value being saved at  
its pointer. This checksumming continues all the way up the file  
system's data hierarchy to the root node, which is also checksummed,  
thus creating a Merkle tree.[70] In-flight data corruption or Phantom  
reads/writes (the data written/read checksums correctly but is  
actually wrong) are undetectable by most filesystems as they store the  
checksum with the data. ZFS stores the checksum of each block in its  
parent block pointer so the entire pool self-validates.[71]

ZFS data integrity
When a block is accessed, regardless of whether it is data or  
meta-data, its checksum is calculated and compared with the stored  
checksum value of what it "should" be. If the checksums match, the  
data are passed up the programming stack to the process that asked for  
it. If the values do not match, then ZFS can heal the data if the  
storage pool has redundancy via ZFS mirroring or RAID.[72] If the  
storage pool consists of a single disk, it is possible to provide such  
redundancy by specifying "copies=2" (or "copies=3"), which means that  
data will be stored twice (thrice) on the disk, effectively halving  
(or, for "copies=3", reducing to one third) the storage capacity of  
the disk.[73] If redundancy exists, ZFS will fetch a copy of the data  
(or recreate it via a RAID recovery mechanism), and recalculate the  
checksum—ideally resulting in the reproduction of the originally  
expected value. If the data passes this integrity check, the system  
can then update the faulty copy with known-good data so that  
redundancy can be restored.

Resilvering and scrub
ZFS has no fsck repair tool equivalent, common on Unix filesystems,  
which does file system validation and file system repair.[75] Instead,  
ZFS has a repair tool called "scrub" which examines and repairs Silent  
Corruption and other problems. Some differences are:

fsck must be run on an offline filesystem, which means the filesystem  
must be unmounted and is not usable while being repaired.
scrub does not need the ZFS filesystem to be taken offline. scrub is  
designed to be used on a working, mounted alive filesystem.
fsck usually only checks metadata (such as the journal log) but never  
checks the data itself. This means, after an fsck, the data might  
still be corrupt.
scrub checks everything, including metadata and the data. The effect  
can be observed by comparing fsck to scrub times — sometimes a fsck on  
a large RAID completes in a few minutes, which means only the metadata  
was checked. Traversing all metadata and data on a large RAID takes  
many hours, which is exactly what scrub does.
The official recommendation from Sun/Oracle is to scrub once every  
month with Enterprise disks, because they have much higher reliability  
than cheap commodity disks. If using cheap commodity disks, scrub  
every week.[76][77]"

https://en.wikipedia.org/wiki/ZFS


>
>
>
>
>
>
> On Fri, 2013-11-15 at 13:16 +0200, Andrei Pascal wrote:
>> Of of, m?i m?i... Po?i l?sa scrub-ul ?la s? ruleze , e drept - dar tot î?i
>> va fute discurile. ?i, întreb eu, CÂND ruleziscrub-ul? Dac? îl rulezi la
>> scriere ?i cu asta basta, nu e nici o diferen?? între ZFS ?i RAID 5 de
>> exemplu. Plus c? e la mintea coco?ului c? într-un mirror nu pui discuri din
>> acela?i lot de produc?ie.
>>
>> Argumentul cu rebuildul mirror-ului ZFS e valabil, dar, cum spuneam mai
>> înainte, ZFS e ?i volume manager ?i filesystem iar suport comercial n-are
>> decât pe Solaris. Pentru tine acas? îns? merge ?i pe *bsd, nu-i bai. Sau
>> dac? nu te intereseaz? suportul.
>>
>>
>> 2013/11/15 Iulian Murgulet <[email protected]>
>>
>> >
>> >
>> > ... mai fac o incercare.
>> >
>> > 1. MD RAID1(2xHDD)
>> >
>> >   Scriu un block de date care sa zicem contine "A1B2"(ca idee) pe un
>> > /dev/mdX, asta inseamna ca in spate, md-ul va scrie acel block identic
>> > pe HDD1 si pe HDD2. HDD1 si 2 zice gata e OK, am terminat
>> >
>> >
>> >   Citesc acel block de date peste 3 luni(ipotetic), md-ul il va citi
>> > de pe HDD1 sau de pe HDD2(round-robin), daca il poate citi fara erori,
>> > zice ca e OK.
>> >
>> > - in astea 3 luni sa intamplat ceva(orice vreti voi sa va imaginati),
>> > si cand citesc el citeste cu SUCCES "A1B0" - e bine? Cu siguranta nu e
>> > bine.
>> >
>> > 2. ZFS-mirror(2xHDD)
>> >
>> > Inainte de a scrie, calculez un check-sum pt. "A1B2", si scriu pe HDD1
>> > si 2 atat blocul de date cat SI check-sum-ul pe fiecare disk.
>> >
>> >   Citesc acel block de date peste 3 luni(ipotetic), zfs-ul il va citi
>> > de pe HDD1 sau de pe HDD2(round-robin), calculez check-sum-ul la ce am
>> > citit si compar cu
>> > check-sum-ul stocat pe disk la scriere, daca bate verificarea, este OK.
>> >   Daca nu bate verificarea, atunci citesc din nou acelasi bloc
>> > oglindit care se afla pe HDD2. Daca aici verificarea este OK, atunci,
>> > scriu din din nou acelasi bloc si pe HDD1, si returnez datele CORECTE
>> > pt. acel block aplicatiei.
>> >
>> >    ZFS scrub, asta face, citeste fiecare bloc de date(si atentie, daca
>> > am un pool de 2 TB, da eu am date doar pe 10 GB, verific doar cei 10
>> > GB de date), si verifica daca bate cu check-sum-ul stocat la scriere.
>> >
>> >
>> > 1. MD RAID1(2xHDD)
>> >
>> > - dintr-un motiv oarecare HDD2 a fost scos din raid(nu discutam cauza)
>> > - il bagam din nou in raid dupa 3 ore sa zicem(acelasi disk ieftin
>> > ...sau unul nou din clasa enterprise), OK se incepe sincronizarea, de
>> > la zero, pe 2 TB chit ca eu am doar 500 GB de date;
>> >
>> > 2. ZFS-mirror(2xHDD)
>> >
>> > - dintr-un motiv oarecare HDD2 a fost scos din raid(nu discutam cauza)
>> > - il bagam din nou in raid dupa 3 ore sa zicem(acelasi disk ieftin
>> > ...sau unul nou din clasa enterprise), OK se incepe sincronizarea, dar
>> > se va face o sincronizare pe doar 500 GB de date si nu pe 2TB.
>> >
>> > .... cam asta-i tot.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ----------------------------------------------------------------
>> > This message was sent using IMP, the Internet Messaging Program.
>> >
>> >
>> > ================================ ATENTIONARI =============================
>> >
>> > - pentru atasamente tip Office va rugam sa folositi format OFFICE 97;
>> > - nu trimiteti date personale (CNP, copii dupa acte de identitate etc).
>> >
>> >  O lista completa cu reguli de utilizare exista la:
>> >
>> > http://gw.casbv.ro/forum_smf/index.php?topic=2000.msg3106#msg3106
>> >
>> > C.A.S.J. Brasov - B-dul Mihail Kogalniceanu, nr. 11,Brasov
>> > [web-site]: http://www.casbv.ro
>> > [forum]: http://gw.casbv.ro/forum_smf/index.php
>> >
>> > ==========================================================================
>> >
>> > _______________________________________________
>> > RLUG mailing list
>> > [email protected]
>> > http://lists.lug.ro/mailman/listinfo/rlug
>> >
>>
>>
>>
>
>
> _______________________________________________
> RLUG mailing list
> [email protected]
> http://lists.lug.ro/mailman/listinfo/rlug
>



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

================ ATENTIONARI ==============
- pentru atasamente tip Office va rugam sa folositi format OFFICE 97;
- nu trimiteti date personale (CNP, copii dupa acte de identitate etc).

 O lista completa cu reguli de utilizare exista la:

http://gw.casbv.ro/forum_smf/index.php?topic 00.msg3106#msg3106

C.A.S.J. Brasov - B-dul Mihail Kogalniceanu, nr. 11,Brasov
[web-site]: http://www.casbv.ro
[forum]: http://gw.casbv.ro/forum_smf/index.php

=====================================

_______________________________________________
RLUG mailing list
[email protected]
http://lists.lug.ro/mailman/listinfo/rlug

Raspunde prin e-mail lui