Le 20/06/2014 18:16, Howard Chu a écrit : > Emmanuel Lécharny wrote: >> Le 20/06/2014 14:50, Howard Chu a écrit : >>> Emmanuel Lécharny wrote: >>>> Hi guys, >>>> >>>> many thanks Kiran for the OOM fix ! >>>> >>>> That's one step toward a fast load of big database load. >>>> >>>> The next steps are also critical. We are currently limited by the >>>> memory >>>> size as we store in memory the DNs we load. In order to go one step >>>> farther, we need to implement a system where we can prcoess a ldif >>>> file >>>> with no limitation due to the available memory. >>>> >>>> That supposes we prcoess the ldif file by chunks, and once the >>>> chuks are >>>> sorted, then we process them as a whole, pulling one element from each >>>> of the sorted list of DN and picking the smallest to inject it into >>>> the >>>> BTree. >>> >>> Why do you store the DNs in memory? Why are you sorting them? >> >> We need to build the RDN index, which contains ParentIDandRDN data >> structure, where each element is a tuple with the parentID and the >> current RDN. That means we must have seen the parent before we can deal >> with the children. This is why we keep the DN in memory. > > Sure, the OpenLDAP backends require this too. But reading the LDIF > twice is ugly.
This is why I suggested not to do so. > > In our bulk load we simply lookup in the database/index to see if the > parent DN exists yet, and if not, we (recursively) generate the > parentID(s) and add them to the index. But you can't then bulk load the ParentID index. > We also keep an in-memory list of such missing DNs for display at the > end of the bulk load. > > Later if the parent entry is actually found in the input, we simply > store it, using the ID that was previously generated. (Looking it up > from the RDN index is fast.) Then remove it from the list of missing DNs. > > In short - pretend you've already seen the parent, if you haven't > actually. Don't worry about it until you reach the end of the input, > then you know for sure it's really missing. > There is one importnat thing : we do need the RDN index (a ParentIdAndRdn index, in fact) to be ordered, because we use it when processing a one level or sub-tree searches. That allows us to fetch a limited number of candidates.
