Hi,

In addition of all split brains reported, is it normal to notice thousands and 
thousands (several tens nay hundreds of thousands) broken symlinks browsing the 
.glusterfs directory on each brick? 
For the moment, i just synchronized one remote directory (around 30TB and a few 
million files) into my new volume. No other operations on files on this volume 
has yet been done.
How can I fix it? Can I delete these dead-symlinks? How can I fix all my 
split-brains? 

Here is an example of a ls:
[root@cl-storage3 ~]# cd /export/brick_home/brick1/data/.glusterfs/7b/d2/
[root@cl-storage3 d2]# ll
total 8,7M
     13706 drwx------   2 root      root            8,0K 26 juil. 17:22 .
2147483784 drwx------ 258 root      root            8,0K 20 juil. 23:07 ..
2148444137 -rwxrwxrwx   2 baaden    baaden_team     173K 22 mai    2008 
7bd200dd-1774-4395-9065-605ae30ec18b
   1559384 -rw-rw-r--   2 tarus     amyloid_team    4,3K 19 juin   2013 
7bd2155c-7a05-4edc-ae77-35ed7e16afbc
    287295 lrwxrwxrwx   1 root      root              58 20 juil. 23:38 
7bd2370a-100b-411e-89a4-d184da9f0f88 -> 
../../a7/59/a759de6f-cdf5-43dd-809a-baf81d103bf7/prop-base
2149090201 -rw-rw-r--   2 tarus     amyloid_team     76K  8 mars   2014 
7bd2497f-d24b-4b19-a1c5-80a4956e56a1
2148561174 -rw-r--r--   2 tran      derreumaux_team  575 14 févr. 07:54 
7bd25db0-67f5-43e5-a56a-52cf8c4c60dd
   1303943 -rw-r--r--   2 tran      derreumaux_team  576 10 févr. 06:06 
7bd25e97-18be-4faf-b122-5868582b4fd8
   1308607 -rw-r--r--   2 tran      derreumaux_team 414K 16 juin  11:05 
7bd2618f-950a-4365-a753-723597ef29f5
     45745 -rw-r--r--   2 letessier admin_team       585  5 janv.  2012 
7bd265c7-e204-4ee8-8717-e4a0c393fb0f
2148144918 -rw-rw-r--   2 tarus     amyloid_team    107K 28 févr.  2014 
7bd26c5b-d48a-481a-9ca6-2dc27768b5ad
     13705 -rw-rw-r--   2 tarus     amyloid_team     25K  4 juin   2014 
7bd27e4c-46ba-4f21-a766-389bfa52fd78
   1633627 -rw-rw-r--   2 tarus     amyloid_team     75K 12 mars   2014 
7bd28631-90af-4c16-8ff0-c3d46d5026c6
   1329165 -rw-r--r--   2 tran      derreumaux_team  175 15 juin  23:40 
7bd2957e-a239-4110-b3d8-b4926c7f060b
    797803 lrwxrwxrwx   2 baaden    baaden_team       26  2 avril  2007 
7bd29933-1c80-4c6b-ae48-e64e4da874cb -> ../divided/a7/2a7o.pdb1.gz
   1532463 -rw-rw-rw-   2 baaden    baaden_team     1,8M  2 nov.   2009 
7bd29d70-aeb4-4eca-ac55-fae2d46ba911
   1411112 -rw-r--r--   2 sterpone  sterpone_team   3,1K  2 mai    2012 
7bd2a5eb-62a4-47fc-b149-31e10bd3c33d
2148865896 -rw-r--r--   2 tran      derreumaux_team 2,1M 15 juin  23:46 
7bd2ae9c-18ca-471f-a54a-6e4aec5aea89
2148762578 -rw-rw-r--   2 tarus     amyloid_team    154K 11 mars   2014 
7bd2b7d7-7745-4842-b7b4-400791c1d149
    149216 -rw-r--r--   2 vamparys  sacquin_team    241K 17 mai    2013 
7bd2ba98-6a42-40ea-87ea-acb607d73cb5
2148977923 -rwxr-xr-x   2 murail    baaden_team      23K 18 juin   2012 
7bd2cf57-19e7-451c-885d-fd02fd988d43
   1176623 -rw-rw-r--   2 tarus     amyloid_team    227K  8 mars   2014 
7bd2d92c-7ec8-4af8-9043-49d1908a99dc
   1172122 lrwxrwxrwx   2 sterpone  sterpone_team     61 17 avril 12:49 
7bd2d96e-e925-45f0-a26a-56b95c084122 -> 
../../../../../src/libs/ck-libs/ParFUM-Tops-Dev/ParFUM_TOPS.h
   1385933 -rw-r--r--   2 tran      derreumaux_team 2,9M 16 juin  05:29 
7bd2df54-17d2-4644-96b7-f8925a67ec1e
    745899 lrwxrwxrwx   1 root      root              58 22 juil. 09:50 
7bd2df83-ce58-4a17-aca8-a32b71e953d4 -> 
../../5c/39/5c39010f-fa77-49df-8df6-8d72cf74fd64/model_009
2149100186 -rw-rw-r--   2 tarus     amyloid_team    494K 17 mars   2014 
7bd2e865-a2f4-4d90-ab29-dccebe2e3440



Best.
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: [email protected]

Le 27 juil. 2015 à 22:57, Geoffrey Letessier <[email protected]> a 
écrit :

> Dears,
> 
> For a couple of weeks (more than one month), our computing production is 
> stopped due to several -but amazing- troubles with GlusterFS. 
> 
> After having noticed a big problem with incorrect quota size accounted for 
> many many files, i decided under the guidance of Gluster team support to 
> upgrade my storage cluster from version 3.5.3 to the latest (3.7.2-3) because 
> these bugs are theoretically fixed in this branch. Now, since i’ve done this 
> upgrade, it’s the amazing mess and i cannot restart the production.
> Indeed :
>       1 - RDMA protocol is not working and hang my system / shell commands; 
> only TCP protocol (over Infiniband) is more or less operational   - it’s not 
> a blocking point but… 
>       2 - read/write performance relatively low
>       3 - thousands split-brains are appeared.
> 
> So, for the moment, i believe GlusterFS 3.7 is not actually production ready. 
> 
> Concerning the third point: after having destroy all my volumes (RAID 
> re-init, new partition, GlusterFS volumes, etc.), recreate the main one, I 
> tried to back-transfert my data from archive/backup server info this new 
> volume and I note a lot of errors in my mount log file, as your can read in 
> this extract:
> [2015-07-26 22:35:16.962815] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
> [2015-07-26 22:35:16.965896] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>, 
> e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and 
> 820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping 
> conservative merge on the file.
> [2015-07-26 22:35:16.975206] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
> [2015-07-26 22:35:28.719935] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
>  951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and 
> 5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping 
> conservative merge on the file.
> [2015-07-26 22:35:29.764891] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 865083fa-984e-44bd-aacf-b8195789d9e0
> [2015-07-26 22:35:29.768339] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <865083fa-984e-44bd-aacf-b8195789d9e0/job.pbs>, 
> e944d444-66c5-40a4-9603-7c190ad86013 on vol_home-client-1 and 
> 820f9bcc-a0f6-40e0-bcec-28a76b4195ea on vol_home-client-0. Skipping 
> conservative merge on the file.
> [2015-07-26 22:35:29.775037] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 29382d8d-c507-4d2e-b74d-dbdcb791ca65
> [2015-07-26 22:35:29.776857] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <29382d8d-c507-4d2e-b74d-dbdcb791ca65/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt>,
>  951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and 
> 5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0. Skipping 
> conservative merge on the file.
> [2015-07-26 22:35:29.800535] W [MSGID: 108008] 
> [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check] 
> 0-vol_home-replicate-0: GFID mismatch for 
> <gfid:29382d8d-c507-4d2e-b74d-dbdcb791ca65>/res_1BVK_r_u_1IBR_l_u_Cond.1IBR_l_u.1BVK_r_u.UB.global.dat.txt
>  951c5ffb-ca38-4630-93f3-8e4119ab0bd8 on vol_home-client-1 and 
> 5ae663ca-e896-4b92-8ec5-5b15422ab861 on vol_home-client-0
> 
> And when I try to browse some folders (still in mount log file):
> [2015-07-27 09:00:19.005763] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
> [2015-07-27 09:00:22.322316] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>, 
> 9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and 
> 1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping 
> conservative merge on the file.
> [2015-07-27 09:00:23.008771] I 
> [afr-self-heal-entry.c:565:afr_selfheal_entry_do] 0-vol_home-replicate-0: 
> performing entry selfheal on 2ac27442-8be0-4985-b48f-3328a86a6686
> [2015-07-27 08:59:50.359187] W [MSGID: 108008] 
> [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check] 
> 0-vol_home-replicate-0: GFID mismatch for 
> <gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012588.gro 
> 9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and 
> 1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0
> [2015-07-27 09:00:02.500419] W [MSGID: 108008] 
> [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check] 
> 0-vol_home-replicate-0: GFID mismatch for 
> <gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0012590.gro 
> b22aec09-2be3-41ea-a976-7b8d0e6f61f0 on vol_home-client-1 and 
> ec100f9e-ec48-4b29-b75e-a50ec6245de6 on vol_home-client-0
> [2015-07-27 09:00:02.506925] W [MSGID: 108008] 
> [afr-self-heal-name.c:353:afr_selfheal_name_gfid_mismatch_check] 
> 0-vol_home-replicate-0: GFID mismatch for 
> <gfid:2ac27442-8be0-4985-b48f-3328a86a6686>/md0009059.gro 
> 0485c093-11ca-4829-b705-e259668ebd8c on vol_home-client-1 and 
> e83a492b-7f8c-4b32-a76e-343f984142fe on vol_home-client-0
> [2015-07-27 09:00:23.001121] W [MSGID: 108008] 
> [afr-read-txn.c:241:afr_read_txn] 0-vol_home-replicate-0: Unreadable 
> subvolume -1 found with event generation 2. (Possible split-brain)
> [2015-07-27 09:00:26.231262] E 
> [afr-self-heal-entry.c:249:afr_selfheal_detect_gfid_and_type_mismatch] 
> 0-vol_home-replicate-0: Gfid mismatch detected for 
> <2ac27442-8be0-4985-b48f-3328a86a6686/md0012588.gro>, 
> 9c635868-054b-4a13-b974-0ba562991586 on vol_home-client-1 and 
> 1943175c-b336-4b33-aa1c-74a1c51f17b9 on vol_home-client-0. Skipping 
> conservative merge on the file.
> 
> And, above all, browsing folder I get a lot of input/ouput errors.
> 
> Currently I have 6.2M inodes and roughly 30TB in my "new" volume.
> 
> For the moment, Quota is disable to increase the IO performance during the 
> back-transfert… 
> 
> Your can also find in attachments:
>       - an "ls" result
>       - a split-brain research result
>       - the volume information and status
>       - a complete volume heal info
> 
> Hoping this can help your to help me to fix all my problems and reopen the 
> computing production.
> 
> Thanks in advance,
> Geoffrey
> 
> PS: « Erreur d’Entrée/Sortie » = « Input / Output Error » 
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ingénieur système
> UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: [email protected]
> 
> <ls_example.txt>
> <split_brain__20150725.txt>
> <vol_home_healinfo.txt>
> <vol_home_info.txt>
> <vol_home_status.txt>

_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to