I have set up a replicated, four-node gluster config for a web farm. The idea is that each web node is its own server, and will have its own copy of the entire web root locally. It then serves the cluster to itself. We're running it over dual GigE NICs bonded.

The problem I am having is when we switch live traffic to nodes in the cluster, they almost immediately get out of sync. The issue seems to be with cache files that are read/written a lot. Here is an excerpt pointing to issues with our OpenX banner cache:

[2012-02-25 18:53:04.198326] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.199191] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.199210] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.199219] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid different on subvolume [2012-02-25 18:53:04.199236] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.200752] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php: gfid differs on subvolume 0 (53fa373a-3830-4c5e-aa22-6ed35c947d97, c12e0cdd-9b6c-4988-b793-819db0472780) [2012-02-25 18:53:04.200971] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.200986] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f8e7a8862cb80b4933c58acdf65aaef5.php [2012-02-25 18:53:04.202159] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 1 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.202178] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 1 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.202188] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.202204] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.203463] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.203678] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.203693] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.204759] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.204781] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.204800] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.204818] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.206150] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.206384] I [afr-self-heal-common.c:963:afr_sh_missing_entries_done] 0-web-pub-replicate-0: split brain found, aborting selfheal of /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.206400] E [afr-self-heal-common.c:2074:afr_self_heal_completion_cbk] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal failed on /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.207725] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.207746] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953) [2012-02-25 18:53:04.207756] W [afr-common.c:882:afr_detect_self_heal_by_iatt] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid different on subvolume [2012-02-25 18:53:04.207772] I [afr-common.c:1038:afr_launch_self_heal] 0-web-pub-replicate-0: background meta-data data missing-entry self-heal triggered. path: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php [2012-02-25 18:53:04.209217] W [afr-common.c:1121:afr_conflicting_iattrs] 0-web-pub-replicate-0: /cust/site1/www/openx/var/cache/deliverycache_f901ff39b456df599289c590ed89b19d.php: gfid differs on subvolume 0 (375e1754-0420-4e26-9176-bb2128c6596b, 3e9eca35-3351-450e-b8ab-c62785968953)

Nodes and network are fine. I have tried mounting the volumes using both the Gluster native client and with the Gluster NFS client but get the same results. It's killing performance.

Here is the config:

  1: volume web-pub-client-0
  2:     type protocol/client
  3:     option remote-host web-web1
  4:     option remote-subvolume /glusterfs/pub
  5:     option transport-type tcp
  6: end-volume
  7:
  8: volume web-pub-client-1
  9:     type protocol/client
 10:     option remote-host web-web2
 11:     option remote-subvolume /glusterfs/pub
 12:     option transport-type tcp
 13: end-volume
 14:
 15: volume web-pub-client-2
 16:     type protocol/client
 17:     option remote-host web-web3
 18:     option remote-subvolume /glusterfs/pub
 19:     option transport-type tcp
 20: end-volume
 21:
 22: volume web-pub-client-3
 23:     type protocol/client
 24:     option remote-host web-web4
 25:     option remote-subvolume /glusterfs/pub
 26:     option transport-type tcp
 27: end-volume
 28:
 29: volume web-pub-replicate-0
 30:     type cluster/replicate
31: subvolumes web-pub-client-0 web-pub-client-1 web-pub-client-2 web-pub-client-3
 32: end-volume
 33:
 34: volume web-pub-write-behind
 35:     type performance/write-behind
 36:     subvolumes web-pub-replicate-0
 37: end-volume
 38:
 39: volume web-pub-read-ahead
 40:     type performance/read-ahead
 41:     subvolumes web-pub-write-behind
 42: end-volume
 43:
 44: volume web-pub-io-cache
 45:     type performance/io-cache
 46:     option cache-size 256MB
 47:     subvolumes web-pub-read-ahead
 48: end-volume
 49:
 50: volume web-pub-quick-read
 51:     type performance/quick-read
 52:     option cache-size 256MB
 53:     subvolumes web-pub-io-cache
 54: end-volume
 55:
 56: volume web-pub
 57:     type debug/io-stats
 58:     option latency-measurement off
 59:     option count-fop-hits off
 60:     subvolumes web-pub-quick-read
 61: end-volume
 62:
 63: volume nfs-server
 64:     type nfs/server
 65:     option nfs.dynamic-volumes on
 66:     option rpc-auth.addr.web-pub.allow *
 67:     option nfs3.web-pub.volume-id ac556d2e-e8a9-4857-bd17-cab603820fcb
 68:     subvolumes web-pub
 69: end-volume


Any ideas or help would be greatly appreciated.

sean

--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203


_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Reply via email to