Ralf,

> Jim, at first: I never said that AVS is a bad product. And I never  
> will.  I wonder why you act as if you were attacked personally.
> To be honest, if I were a customer with the original question, such  
> a reaction wouldn't make me feel safer.

I am sorry that my response came across that way, it was not  
intentional.

>
>>> - ZFS is not aware of AVS. On the secondary node, you'll always  
>>> have to
>>> force the `zfs import` due to the unnoticed changes of metadata  
>>> (zpool
>>> in use).
>> This is not true. If on the primary node invokes "zpool export"  
>> while replication is still active, then a forced "zpool import" is  
>> not required. This behavior is the same as with a zpool on dual- 
>> ported or SAN storage, and is NOT specific to AVS.
>
> Jim. A graceful shutdown of the primary node may be a valid desaster  
> scenario in the laboratory, but it never will be in the real life.

I agree with your assessment that in real life a 'zpool export' will  
never be done in a real disaster, but unconditionally doing a forced  
'zpool import' is problematic. Prior to performing the forced import,  
one needs to assure that the primary node is actually down and is not  
in the process of booting up, or that replication is stopped and will  
not automatically resume.

Failure to make these checks prior to a forced 'zpool import' could  
lead to scenarios where two or more instances of ZFS are accessing the  
same ZFS storage pool, each attempting to writing their own metadata,  
and thus there own CRCs. In time this action will result in CRC  
checksum failures on reads, followed by a ZFS induced panic.


>>> No mechanism to prevent data loss exists, e.g. zpools can be
>>> imported when the replicator is *not* in logging mode.
>> This behavior is the same as with a zpool on dual-ported or SAN  
>> storage, and is NOT specific to AVS.
>
> And what makes you think that I said that AVS is the problem here?
>
> And by the way, the customer doesn't care *why* there's a problem.  
> He only wants to know *if* there's a problem.

There is a mechanism to prevent data lost here, its AVS! This is the  
reasoning behind questioning the association made above of replication  
being part of the problem, where in fact how replication is  
implemented with AVS, it is actually part of the solution.

If one does not following the guidance suggested above before invoking  
a forced 'zpool import', the action will likely result in on-disk CRC  
checksum inconsistencies within the ZFS storage pool, resulting in  
secondary node data loss, the initial point above. Since AVS  
replication is unidirectional there is no data loss on the primary  
node, and when replication is resumed, AVS will undo the faulty  
secondary node writes, correcting the actual data loss, and in time  
restoring 100% synchronization of the ZFS storage pool between the  
primary and secondary nodes.

>
>>> - AVS is not ZFS aware.
>> AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS,  
>> and other host based and controller based replication services  
>> multi-functional. If you desire ZFS aware functionality, use ZFS  
>> send and recv.
>
> Yes, exactly. And that's the problem, sind `zfs send` and `zfs  
> receive` are no working solution in a fail-safe two node  
> environment. Again: the customer doesn't care *why* there's a  
> problem. He only wants to know *if* there's a problem.

My takeaway from this is that both AVS and ZFS are data path services,  
but collectively they are not on their own a complete disaster  
recovery solution. Since AVS is not aware of ZFS, and vice-versa,  
additional software in the form of Solaris Cluster, GeoCluster or  
other developed software needs to provide the awareness, so that  
viable disaster recovery solutions can be possible, and supportable.


>>> For instance, if ZFS resilves a mirrored disk,
>>> e.g. after replacing a drive, the complete disk is sent over the  
>>> network
>>> to the secondary node, even though the replicated data on the  
>>> secondary
>>> is intact.

The problem with this statement is that one can not guarantee that the  
replicated data on the secondary is intact, specifically that the data  
is 100% identical to the non-failing side of the mirror on the primary  
node. Of course if this guarantee could be assured, then an "sndradm - 
E ...", (equal enable) could be done, and the full disk copy could be  
avoided. But all is not lost...

A failure in writing to a mirrored volume almost assures that the data  
will be different, by at least one I/O, the one that triggered the  
initial failure of the mirror. The momentary upside is that AVS is  
interposed above the failing volume, so that the I/O will get  
replicated, even if it failed to make it the disk. The downside is  
that with ZFS (or any other mirroring software), once a failure is  
detected by the mirroring software, it will stop writing to the side  
of the mirror containing the failed disk (and thus the configured AVS  
replica), but will still continue to write to the non-failing side of  
the mirror. This assures that the good side of the mirror, and the  
replica will be out of sync.

>>>
>> The complete disk IS NOT sent of the over the network to the  
>> secondary node, only those disk blocks that re-written by ZFS.
>
> Yes, you're right. But sadly, in the mentioned scenario of having  
> replaced an entire drive, the entire disk is rewritten ZFS.

I have to believe that the issue being referred to is an order of  
enabling issue. One needs to enable AVS before ZFS. Let me explain.

If I have a replacement volume that has yet to be given to ZFS, it  
contains unknown data. Likewise its replacement volume on the  
secondary node also contains unknown data (even if this volume is the  
one above, as it known to not be 100% intact).  If one was to enable  
these two volumes with the "sndradm -E ...", where 'E' means equal  
enable, this means to the replication software that unknown data =  
unknown data, therefore no replication is needed to bring the two  
volume into synchronization. Now when one gives the primary node  
volume to ZFS as a replacement, ZFS and thus AVS, only need to rewrite  
those metadata and data blocks that are in use by ZFS on the remaining  
good side of the mirror. This means a full-copy is avoided, unless of  
course the volume is full.

Conversely, if one gives the replacement volume to ZFS prior to  
enabling the volume in AVS for replication, then "sndradm -E ..." can  
not be used, as the volumes are not starting out equal, and AVS was  
not running to scoreboard the differences. Therefore "sndradm -e ...",  
must be used, and in this case the entire disk will be replicated.

> Again: And what makes you think that I said that AVS is the problem  
> here?
>
>>> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may  
>>> not be
>>> imported on the secondary node during the replication.
>> This behavior is the same as with a zpool on dual-ported or SAN  
>> storage, and is NOT specific to AVS.
>
> Again: And what makes you think that I said that AVS is the problem  
> here? We are not on avs-discuss, Jim.

Your association of "ZFS & AVS & X4500", this is purely a ZFS issue.

The problem at hand is that a ZFS storage pool can not be concurrently  
accessed by two or more instances of ZFS. This is true for both shared  
storage and replicated storage. This remains true even if one instance  
of ZFS will be operating in a read-only mode.


>> I don't understand the relevance to AVS in the prior three  
>> paragraphs?
>
> We are not on avs-discuss, Jim. The customer wanted to know what  
> drawbacks exist in his *scenario*. Not AVS.
>
>>> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft
>>> partitions). Weren't enough, the replication was still very slow,
>>> probably because of an insane amount of head movements, and scales
>>> badly. Putting the bitmap of a drive on the drive itself (if I  
>>> remember
>>> correctly, this is recommended in one of the most referenced howto  
>>> blog
>>> articles) is a bad idea. Always use ZFS on whole disks, if  
>>> performance
>>> and caching matters to you.
>> When you have the time, can you replace the "probably because  
>> of ... " with some real performance numbers?
>
> No problem. If you please organize a Try&Buy of two X4500 server  
> being sent to my address, thank you.

Done:

        http://blogs.sun.com/AVS/entry/sun_storagetek_availability_suite_4
        http://www.sun.com/tryandbuy/specialoffers.jsp


>>> - AVS seems to require an additional shared storage when building
>>> failover clusters with 48 TB of internal storage. That may be hard  
>>> to
>>> explain to the customer. But I'm not 100% sure about this, because I
>>> just didn't find a way, I didn't ask on a mailing list for help.
>> When you have them time, can you replace the "AVS seems to ... "  
>> with some specific references to what you are referring to?
>
> The installation and configuration process and the location where  
> AVS wants to store the shared database. I can tell you details about  
> it the next time I give it try. Until then, please read the last  
> sentence you quoted once more, thank you.

The design of AVS in a failover Sun Cluster requires shared access to  
AVS's cluster-wide configuration data. This data is fixed at ~16.5 MB,  
and must be contained on a single volume the can be concurrently  
accessed by all nodes in a Sun Cluster. At the time AVS was enhanced  
to support Sun Cluster, various options where taken under  
consideration, this was the design selected, such as it may be.

FWIW: Across all of Solaris, there a various methods of maintaining  
persistent configuration data. Sun Cluster uses it CCR database, SVM  
uses its metadb database, Solaris is starting to use its SCF database  
(part of SMF), the list goes on and on. The AVS developers approached  
Sun Cluster developers asking to use their CCR database mechanism, but  
at the time the answer was no. At this time it would be hard to  
reconsider this position.


>>> If you want a fail-over solution for important data, use the  
>>> external
>>> JBODs. Use AVS only to mirror complete clusters, don't use it to
>>> replicate single boxes with local drives. And, in case OpenSolaris  
>>> is
>>> not an option for you due to your company policies or support  
>>> contracts,
>>> building a real cluster also A LOT cheaper.
>> You are offering up these position statements based on what?
>
> My outline agreements, my support contracts, partner web desk and  
> finally my experience with projects in high availability scenarios  
> with tens of thousands of servers.
>
> Jim, it's okay. I know that you're a project leader at Sun  
> Microsystems and that AVS is your main concern. But if there's one  
> thing I cannot withstand, it's getting stroppy replies from someone  
> who should know better and should have realized that he's acting  
> publicly and in front of the people who finance his income instead  
> of trying to start a flame war. From now on, I leave the rest to  
> you, because I earn my living with products of Sun Microsystems,  
> too, and I don't want to damage neither Sun nor this mailing list.

My reasoning for posting not only the original but subsequent reply,  
is that AVS is constantly bombarded by "War wounds", where in fact the  
reasons that many of these stories exist is do in part to the fact  
that developing and deploying disaster recovery or high availability  
solutions is not easy. ZFS is the new "battlefront", allowing for  
opportunities to learn about ZFS, AVS and other replication  
technologies. In their day, similar "War wounds" and successful  
"battles" have been had regarding AVS in use with UFS, QFS, VxFS, SVM,  
VxVM, Oracle, Sybase and others.

Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.


>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> [EMAIL PROTECTED] - http://web.de/
>
> 1&1 Internet AG
> Brauerstraße 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas  
> Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver  
> Mauss, Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to