Bulk load test with 500 000 entries

Emmanuel Lécharny Tue, 02 Sep 2014 17:05:24 -0700

Hi guys,

I ran a small test on a 500K entries LDIF file. Currently, I'm just
writing the master table (ie, the entries).


Reading the 500 000 entries from the LDIF file takes 209 seconds. This
is around 2400 entries per second.
Writing the MasterTable takes 54 seconds, around 9300 entries written
per second.

I still have to write the indexes, which will add another 50 second, I
think. At this point, we can consider we are able to write 1600 entries
per second.

Note that I don't yet support merged sort for the entries I read, so you
need to set the required memory for all the entries (in my case, I used
16Gb). What I'm working on is a merged sort system, which allows the
BulkLoader to work on smaller chunk of data (typically, 50 000).

This is still a bit slow, but for a smaller number of entry, I get a
better result (around 3000 entries injected per second, with 30 000
entries processed).

What takes long is the initializing of entries we are reading, assuming
we wnat them to be schema aware, which leads to numerous checks and
normalization. There is little we can do here, except disabling most of
the check, which is quite complicated, and would require some huge
refactoring of the LDAP API.

In order to load a LDIF file with 10 millions of entries would require
around 2 hours of processing on my machine). Globally, reading the
entries takes 35% of the global time, processing the read entry before
we can write them count for 17% and writing the entries for 24%. The
remaining is for initialization (but this is a constant time).
Serialization and LDIF parsing are the most expensive operation.

Last, not least, I'm basing my work on the existing BulkLoader written
by Kiran. I'm just trying to make it possible to load bigger files, and
to do it faster. The logic in charge of writing data in the Mavinot file
hasn't been modified so far.

I'll keep you informed this week (I'm working on that after my day job
hours...)

Thanks !

Bulk load test with 500 000 entries

Reply via email to