Emmanuel Lécharny wrote:
Hi guys,

sorry for having been absent the last few weeks (side project ...).

I have spend part of the last two days reviewing the bulkLoader tool.
Currently, on my machine, I'm able to process 30 000 entries in less
than 20 seconds (this is around 1600 added entries per second). It's
fats, compared to a direct injection of entries in a running LDAP
server, but we can do way better.

Here are some thoughts :

- first of all, we need to process the DN before being able to inject
entries into the master table. This is required because we have to
inject the ParentID attribute into each entry, which requires we have a
complete hierarchy available. Here, we have two options :

1) read the entry's DN from the LDIF file first, ignoring the values. We
have a FastLdifReader class that does read the DN, and pass through the
other elements. We assume that we have enough memory to hold all the DNs
we will read (which could be a limitation when we try to bulkload tens
of million entriues). Once done, we can associate an ID to each RDN, and
get ready to inject those ID into the entries, which will be done while
creating the MasterTable. What if we don't have enough memory ? Well, case 2

2) As we may not have enough memory to hold all the DNs, we have to do
an external sort. Ie, we load a limited number of DNs, and once we have
reached the limit, we sort what we get, and save the sorted result on
disk. Once we have gone through al the DNs, we ends with N files that
are all atomically sorted, but need to be gathered. Thsi is done in a
second phase, by fetching the first Dn from each of the sorted files,
until we have no more DN to read from any file.

The second option will obviously be more costly, but it's guaranteed to
work no matter how much memory we have. And if we are lucky to have
enough memory to store all the elements in memory, we fail back to the
first option.

What we keep in memory is not only the DN, but the EntryUUID, so that we
can store it into each entry later (this will be the parentId attribute).

At the same time, we need to gather all the entryUUID, as we need them
sorted to be able to create the masterTable (we have the same constraint
than for DNs, ie, if we don't have enough memory, then we need to create
temporary files). We keep a position in the original file to be able to
fetch the entries directly.

So, bottom line, at the end of this first phase, we will have a sorted
list of UUID and an hierarchically sorted set of DNs. If we are lucky,
they are in memory, and if we are very lucky (or if we don't have a lot
of entries to process, even the entries will be in memory).

- once this first phase is done, we can build the MasterTable : we just
have to get the entryUUID from the sorted list of entryUUID, and create
the masterTable from it. It's just a matter of fetching the entry from
disk using its position, parse it and store it, once we have added the
parentId and the other required system attributes (entryCSN,
creatorsName, and creationTime).

- at the very same time, we need to create the other indexes. For the
same reason (memory limitation), we may have to go through intermediate
files and process an external sort. In any case, we can delegate those
index creations to a dedicated thread, to benefit from the multiple
cores a computer has, speeding up the processing. The algorithm is no
different than teh one we used to create the matser table : we store in
memory the value associated with the entryUUID it is related to, and
once done, we push them into a new Table.

Side not though : for indexes, we may have multiple entryUUID associated
with a value (typical case for ObjectClass index).

I'm not sure we can do any better to imrpove the performances. Using
cache for the MasterTable should not be useful, if we can't store all
the entries in memory.

thgoughts ?

Using two different approaches is gratuitous complexity. Using multiple passes and external sorting is unnecessary.

UUIDs are generally not a sortable value, or not meaningful anyway. That seems like a lot of wasted effort.

I explained OpenLDAP's approach once before but will summarize it again:

It's a single-pass loader, and it works regardless of the entry order of the input LDIF. For each entry, we examine its DN and recursively lookup all of its parent DNs in the RDN index, from top down. For any missing RDN in this lookup, we create the RDN index entry anyway, and generate an entryID for it at that time. We also keep an in-memory list of missing DNs.

If we encounter an entry later in the LDIF that corresponds to one of these missing DNs, the search in the RDN index will just return the entryID we already assigned to it. We then remove the DN from the missing DN list. The result is that the DB tables and entryIDs are generated in DN order even if the entries aren't ordered in the LDIF.

At the end of the run, any slots left in the missing DN list are printed out in a warning message. You can still use either the bulk loader or ldapadd to supply these entries later.

E.g., given an LDIF with these entries

dn: dc=example,dc=com

dn: cn=joe,ou=people,dc=example,dc=com

dn: cn=foo,ou=groups,dc=example,dc=com

dn: ou=groups,dc=example,dc=com


dc=example,dc=com gets entryID 1.

cn=joe,ou=people,dc=example,dc=com causes ou=people,dc=example,dc=com to be indexed with entryID 2, marked missing. cn=joe gets entryID 3.

cn=foo,ou=groups,dc=example,dc=com causes ou=groups,dc=example,dc=com to be indexed with entryID 4, marked missing. cn=foo gets entryID 5.

ou=groups,dc=example,dc=com finds that it's already got entryID 4, gets removed from missing DN list.

ou=people,dc=example,dc=com is printed as missing. Its entryID remains reserved for it and will be used when it gets properly added.

There's no redundant parsing of the LDIF required, no multiple passes of processing. You simply use the RDN index that you're already generating to keep the entry DNs properly sorted. There's no need for external sorts and no need to worry about what fits in memory (aside from the missing DN list, which you could also maintain as an external table if you really wanted to, but in a proper LDIF there should not be very many DNs in this list anyway).

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Reply via email to