Emmanuel Lécharny wrote:
Le 20/06/2014 14:50, Howard Chu a écrit :
Emmanuel Lécharny wrote:
Hi guys,
many thanks Kiran for the OOM fix !
That's one step toward a fast load of big database load.
The next steps are also critical. We are currently limited by the memory
size as we store in memory the DNs we load. In order to go one step
farther, we need to implement a system where we can prcoess a ldif file
with no limitation due to the available memory.
That supposes we prcoess the ldif file by chunks, and once the chuks are
sorted, then we process them as a whole, pulling one element from each
of the sorted list of DN and picking the smallest to inject it into the
BTree.
Why do you store the DNs in memory? Why are you sorting them?
We need to build the RDN index, which contains ParentIDandRDN data
structure, where each element is a tuple with the parentID and the
current RDN. That means we must have seen the parent before we can deal
with the children. This is why we keep the DN in memory.
Sure, the OpenLDAP backends require this too. But reading the LDIF twice is
ugly.
In our bulk load we simply lookup in the database/index to see if the parent
DN exists yet, and if not, we (recursively) generate the parentID(s) and add
them to the index. We also keep an in-memory list of such missing DNs for
display at the end of the bulk load.
Later if the parent entry is actually found in the input, we simply store it,
using the ID that was previously generated. (Looking it up from the RDN index
is fast.) Then remove it from the list of missing DNs.
In short - pretend you've already seen the parent, if you haven't actually.
Don't worry about it until you reach the end of the input, then you know for
sure it's really missing.
Never do the same work twice. The DB already maintains DNs in sorted order,
there's no need to explicitly sort in the bulk load tool.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/