Bulk Loader : some ideas...

Emmanuel Lécharny Sun, 17 Aug 2014 05:46:13 -0700

Hi guys,

sorry for having been absent the last few weeks (side project ...).


I have spend part of the last two days reviewing the bulkLoader tool.
Currently, on my machine, I'm able to process 30 000 entries in less
than 20 seconds (this is around 1600 added entries per second). It's
fats, compared to a direct injection of entries in a running LDAP
server, but we can do way better.

Here are some thoughts :

- first of all, we need to process the DN before being able to inject
entries into the master table. This is required because we have to
inject the ParentID attribute into each entry, which requires we have a
complete hierarchy available. Here, we have two options :

1) read the entry's DN from the LDIF file first, ignoring the values. We
have a FastLdifReader class that does read the DN, and pass through the
other elements. We assume that we have enough memory to hold all the DNs
we will read (which could be a limitation when we try to bulkload tens
of million entriues). Once done, we can associate an ID to each RDN, and
get ready to inject those ID into the entries, which will be done while
creating the MasterTable. What if we don't have enough memory ? Well, case 2

2) As we may not have enough memory to hold all the DNs, we have to do
an external sort. Ie, we load a limited number of DNs, and once we have
reached the limit, we sort what we get, and save the sorted result on
disk. Once we have gone through al the DNs, we ends with N files that
are all atomically sorted, but need to be gathered. Thsi is done in a
second phase, by fetching the first Dn from each of the sorted files,
until we have no more DN to read from any file.

The second option will obviously be more costly, but it's guaranteed to
work no matter how much memory we have. And if we are lucky to have
enough memory to store all the elements in memory, we fail back to the
first option.

What we keep in memory is not only the DN, but the EntryUUID, so that we
can store it into each entry later (this will be the parentId attribute).

At the same time, we need to gather all the entryUUID, as we need them
sorted to be able to create the masterTable (we have the same constraint
than for DNs, ie, if we don't have enough memory, then we need to create
temporary files). We keep a position in the original file to be able to
fetch the entries directly.

So, bottom line, at the end of this first phase, we will have a sorted
list of UUID and an hierarchically sorted set of DNs. If we are lucky,
they are in memory, and if we are very lucky (or if we don't have a lot
of entries to process, even the entries will be in memory).

- once this first phase is done, we can build the MasterTable : we just
have to get the entryUUID from the sorted list of entryUUID, and create
the masterTable from it. It's just a matter of fetching the entry from
disk using its position, parse it and store it, once we have added the
parentId and the other required system attributes (entryCSN,
creatorsName, and creationTime).

- at the very same time, we need to create the other indexes. For the
same reason (memory limitation), we may have to go through intermediate
files and process an external sort. In any case, we can delegate those
index creations to a dedicated thread, to benefit from the multiple
cores a computer has, speeding up the processing. The algorithm is no
different than teh one we used to create the matser table : we store in
memory the value associated with the entryUUID it is related to, and
once done, we push them into a new Table.

Side not though : for indexes, we may have multiple entryUUID associated
with a value (typical case for ObjectClass index).

I'm not sure we can do any better to imrpove the performances. Using
cache for the MasterTable should not be useful, if we can't store all
the entries in memory.

thgoughts ?

Bulk Loader : some ideas...

Reply via email to