Hi guys, I ran a small test on a 500K entries LDIF file. Currently, I'm just writing the master table (ie, the entries).
Reading the 500 000 entries from the LDIF file takes 209 seconds. This is around 2400 entries per second. Writing the MasterTable takes 54 seconds, around 9300 entries written per second. I still have to write the indexes, which will add another 50 second, I think. At this point, we can consider we are able to write 1600 entries per second. Note that I don't yet support merged sort for the entries I read, so you need to set the required memory for all the entries (in my case, I used 16Gb). What I'm working on is a merged sort system, which allows the BulkLoader to work on smaller chunk of data (typically, 50 000). This is still a bit slow, but for a smaller number of entry, I get a better result (around 3000 entries injected per second, with 30 000 entries processed). What takes long is the initializing of entries we are reading, assuming we wnat them to be schema aware, which leads to numerous checks and normalization. There is little we can do here, except disabling most of the check, which is quite complicated, and would require some huge refactoring of the LDAP API. In order to load a LDIF file with 10 millions of entries would require around 2 hours of processing on my machine). Globally, reading the entries takes 35% of the global time, processing the read entry before we can write them count for 17% and writing the entries for 24%. The remaining is for initialization (but this is a constant time). Serialization and LDIF parsing are the most expensive operation. Last, not least, I'm basing my work on the existing BulkLoader written by Kiran. I'm just trying to make it possible to load bigger files, and to do it faster. The logic in charge of writing data in the Mavinot file hasn't been modified so far. I'll keep you informed this week (I'm working on that after my day job hours...) Thanks !
