Rein Tollevik wrote: > On Tue, 25 Mar 2008, [EMAIL PROTECTED] wrote: > It now looks to me as if the entire rs->sr_entry was released an reused, > and that the bug probably is in back-bdb. It just always happened when > syncprov was running as this is a master server with mostly writes > except from the reads syncprov does. Which also means that the title on > the bug report is probably misleading :-( > > The reason for my shift of focus is some highly suspicious variables > found in the bdb_search() frame.
Look at the value of idflag in bdb_search. (see the comment around 672) Could be a locking bug with ID_NOCACHE in cache.c... > First, ei->bei_e seem to be NULL every time this happen, not the same as > the "e" variable as I would except from looking at the code. Second, > e->e_id is either 0 or ei->bei_id+1, the content of *e is the (more or > less completely initialized) entry following the *ei entry in the > database, while I would expect that ei->bei_id==e->e_id and that > ei->bei_e==e. I don't think the consecutive ID numbers or entries has > anything with the bug to do though, they are probably a result of the > searches reading the entire database. > > I looks to me as if the EntryInfo referred to by ei was released somehow > when slap_send_search_entry() was running, and that the ei->bei_e passed > as rs->rs_entry was first released and then reused for the next entry. > Alternatively that the structures wasn't properly initialized and > thereby appeared free to the second thread. > > I have found a core file that adds to my suspicions, this time with two > active threads. Their stack frames are shown at the end. In this core > file the id variable in the bdb_search() frame is 9318, the second > thread is loading id==9319. The value of the "e" variable in > bdb_search() for the first thread equals the "x" variable of the > entry_decode() frame in the second thread. Since entry_decode() has > just allocated its "x" value it shouldn't have found the same value as > the other thread has unless someone released its "e" entry while it was > still using it. Why this happened I haven't figured out yet. > > The cachsesize directive has its default value, which is way below the > optimal value. The result is that the cache is always full. I don't > know if that can have any influence on the bug? I tried to lower its > value to see if that should help reproducing the problem, but not so > far. > >>> I'm currently trying to gather more information related to this bug, any >>> pointers as to what I should look for is appreciated. I'm posting this bug >>> report now in the hope that the stack frame should enlighten someone with >>> better >>> knowledge of the code than what I have. >> Check for stack overruns, compile without optimization and make sure it's not >> a compiler optimization bug, etc. > > I'll try to run it with a version compiled with only -g and see if the > same thing should happen then. > -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/
