Hi Kiran, I did a bit of profiling today, and was able to improve the perfs by 7%. The method I speeded up is PrepareString. I created a specific method which does not crerate a new char[] when we are dealing with ASCII chars only. The gain is huge.
Otherwise, most of the time is -as expected- spent in the deserialization of entries read from the MasterTable. At this point, I think we should think about what we can do to avoid such cost. Most of the time, we will have enough memory to load all the elements that will be stored into an index. I'm wondering if it would not be better to parse the LDIF once, gather what we can in memory (but not keeping the whole entry in memory) and build the index directly, then process the master table. It's not easy, because we can't know how much elements we can store in memory, and when we reach the memory limit, then we have to do something which is completely different. If we decide to deal with the memory limitation from the beginning, we will pay the price and it will be expensive. OTOH, most of the time we won't have to care about the memory for two reasons : - either we have to deal with a limited number of entries in the ldif file - or we have enough memory to handle the whole file (on my computer, I can provide 14Gb to the JVM, enough to process 5M entries if each one of them is 1kb large) I'm now thinking that it would be better to have 2 possible algorithm : - a in-memory one, which does not care aboyt what could happen when we reach the end of the memory - a 'smarter' one which take control when we get an OOM This can be done the same way we do with the DN parser : we have a fast parser, which throw an exception if it sees a special case, and a full parser. Same here, but we catch the OOM instead. Of course, we cna probably try to 'predict' which one to use when we start the bulk load, to avoid spending time with the in-memory process. Or we can let the user decide. Wdyt ?
