Hi,

I've got a two-node MMR setup using syncrepl running 2.4.36 with back-bdb.

Last night, between 1/3 and 1/2 of the entries in our directory disappeared.  
Fortunately I have olcLogLeve=sync set, so I think I have a line on what 
happened...

I see this at 4AM long of the morning before the fun started:

Jul  7 04:19:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of 
available lock entries
Jul  7 04:19:30 zoot slapd[15011]: => bdb_idl_insert_key: c_get failed: Cannot 
allocate memory (12)

Then many hours later:
Jul  7 15:04:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of 
available lock entries
Jul  7 15:04:30 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot 
allocate memory (12)
Jul  7 15:04:30 zoot slapd[15011]: null_callback : error code 0x50
Jul  7 15:04:30 zoot slapd[15011]: slap_graduate_commit_csn: removing 
0x7fa33c248a80 20140707190424.188580Z#000000#001#000000
Jul  7 15:04:30 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete 
cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)

This repeats itself several times:
Jul  7 16:41:09 zoot slapd[15011]: syncprov_matchops: skipping original sid 001
Jul  7 16:41:09 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of 
available lock entries
Jul  7 16:41:09 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot 
allocate memory (12)
Jul  7 16:41:09 zoot slapd[15011]: null_callback : error code 0x50
Jul  7 16:41:09 zoot slapd[15011]: slap_graduate_commit_csn: removing 
0x7fa37e4b4260 20140707204059.442906Z#000000#001#000000
Jul  7 16:41:09 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete 
cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)

Jul  7 16:52:44 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of 
available lock entries
Jul  7 16:52:44 zoot slapd[2179]: => bdb_idl_delete_key: c_get failed: Cannot 
allocate memory (12)
Jul  7 16:52:44 zoot slapd[2179]: null_callback : error code 0x50
Jul  7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 
0x7f74581dae70 20140707205225.029175Z#000000#001#000000
Jul  7 16:52:44 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete 
cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul  7 16:52:44 zoot slapd[2179]: slap_queue_csn: queing 0x7f7458be8470 
20140707205225.029175Z#000000#001#000000
Jul  7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 
0x7f74581ad2e0 20140707205225.029175Z#000000#001#000000

Then we get (possibly unrelated) a lengthy list of nonpresent_callbacks (tons 
more then I list here):
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
5111f346-871c-1033-8d4b-f5dbd10b8ca0, dn 
uid=rvenkatachalam,ou=Users,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
1e3d94e4-8ce8-1033-8d6a-f5dbd10b8ca0, dn 
cn=BIOL399-09-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
45843aa6-91b8-1033-81a9-ff6ec337b146, dn 
cn=MSPC3090-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
4dd09f16-921b-1033-8d8e-f5dbd10b8ca0, dn 
cn=msit3710-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
0b3e8d76-9607-1033-81cf-ff6ec337b146, dn 
cn=acct5411-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
a698a278-9677-1033-81ee-ff6ec337b146, dn 
cn=ESL0051-E1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 
ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn 
uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu

Followed by another lock table / del_nonpresent:
Jul  7 17:06:24 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of 
available lock entries
Jul  7 17:06:24 zoot slapd[2179]: => bdb_idl_delete_key: c_del id failed: 
Cannot allocate memory (12)
Jul  7 17:06:24 zoot slapd[2179]: null_callback : error code 0x50
Jul  7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 
0x7f73e016cb40 20140707210614.062782Z#000000#001#000000
Jul  7 17:06:24 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete 
cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul  7 17:06:24 zoot slapd[2179]: slap_queue_csn: queing 0x7f73e0f0eaa0 
20140707210614.062782Z#000000#001#000000
Jul  7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 
0x7f73e0f180c0 20140707210614.062782Z#000000#001#000000

That repeats itself @ 18:07, 18:47, 18:55.  Then the train departs for crazy 
town... notice this is the other node:

ul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID 0e6fea9e-9279-1033-8d97-f5dbd10b8ca0, dn 
cn=MSPC3070-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID f710e364-958c-1033-8da2-f5dbd10b8ca0, dn 
cn=MKT5900-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID 815ece60-965e-1033-8dab-f5dbd10b8ca0, dn 
cn=ESL0067-M1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID f9d8a408-96cf-1033-8db0-f5dbd10b8ca0, dn 
cn=eng2950-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID 480d8cbc-96d3-1033-8dbf-f5dbd10b8ca0, dn 
cn=esl0011-e1-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent 
UUID ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn 
uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: slap_graduate_commit_csn: removing 
0x7f59c8eddc00 20140707225554.051198Z#000000#002#000000
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=MPA3950-03-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=MSIT3950-0R-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
uid=zbamberger,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
uid=hmunawar,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=scrn2940-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=eng2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=id2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=msit3950-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 
002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete 
cn=mspc3060-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)

That goes on, deleting things, until my pager exploded when some critical role 
accounts got nuked and our radius servers died.  I pulled an LDIF when I 
realized what was happening and between 18:55 and 19:35, our LDIF went from 
75MB to 45MB - so about 30MB of stuff disappeared.

What happened?  Did I get hit by an oddball BDB/Syncrepl behavior?  Do I have a 
bad misconfiguration in my replication setup?

Thanks for your time,

Aaron
----
Manager of Systems Administration
Clark University ITS

Reply via email to