Hi guys, sorry for having been absent the last few weeks (side project ...).
I have spend part of the last two days reviewing the bulkLoader tool. Currently, on my machine, I'm able to process 30 000 entries in less than 20 seconds (this is around 1600 added entries per second). It's fats, compared to a direct injection of entries in a running LDAP server, but we can do way better. Here are some thoughts : - first of all, we need to process the DN before being able to inject entries into the master table. This is required because we have to inject the ParentID attribute into each entry, which requires we have a complete hierarchy available. Here, we have two options : 1) read the entry's DN from the LDIF file first, ignoring the values. We have a FastLdifReader class that does read the DN, and pass through the other elements. We assume that we have enough memory to hold all the DNs we will read (which could be a limitation when we try to bulkload tens of million entriues). Once done, we can associate an ID to each RDN, and get ready to inject those ID into the entries, which will be done while creating the MasterTable. What if we don't have enough memory ? Well, case 2 2) As we may not have enough memory to hold all the DNs, we have to do an external sort. Ie, we load a limited number of DNs, and once we have reached the limit, we sort what we get, and save the sorted result on disk. Once we have gone through al the DNs, we ends with N files that are all atomically sorted, but need to be gathered. Thsi is done in a second phase, by fetching the first Dn from each of the sorted files, until we have no more DN to read from any file. The second option will obviously be more costly, but it's guaranteed to work no matter how much memory we have. And if we are lucky to have enough memory to store all the elements in memory, we fail back to the first option. What we keep in memory is not only the DN, but the EntryUUID, so that we can store it into each entry later (this will be the parentId attribute). At the same time, we need to gather all the entryUUID, as we need them sorted to be able to create the masterTable (we have the same constraint than for DNs, ie, if we don't have enough memory, then we need to create temporary files). We keep a position in the original file to be able to fetch the entries directly. So, bottom line, at the end of this first phase, we will have a sorted list of UUID and an hierarchically sorted set of DNs. If we are lucky, they are in memory, and if we are very lucky (or if we don't have a lot of entries to process, even the entries will be in memory). - once this first phase is done, we can build the MasterTable : we just have to get the entryUUID from the sorted list of entryUUID, and create the masterTable from it. It's just a matter of fetching the entry from disk using its position, parse it and store it, once we have added the parentId and the other required system attributes (entryCSN, creatorsName, and creationTime). - at the very same time, we need to create the other indexes. For the same reason (memory limitation), we may have to go through intermediate files and process an external sort. In any case, we can delegate those index creations to a dedicated thread, to benefit from the multiple cores a computer has, speeding up the processing. The algorithm is no different than teh one we used to create the matser table : we store in memory the value associated with the entryUUID it is related to, and once done, we push them into a new Table. Side not though : for indexes, we may have multiple entryUUID associated with a value (typical case for ObjectClass index). I'm not sure we can do any better to imrpove the performances. Using cache for the MasterTable should not be useful, if we can't store all the entries in memory. thgoughts ?
