On 08/12/11 20:04, Nathan Kurz wrote:
I'm mostly listening in on this conversation because I haven't thought much about indexing, but the magnitude of improvement here surprises me: I wouldn't have thought that there would be that much time to shave off! My presumption was that everything would be dominated by Disk IO, and that the actual tokenizing time would be tiny. Are these numbers both working within memory with a pre-warmed cache so no disk reads are involved? Also, have you controlled for whether the data is sync'ed to disk after the indexing?
These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU. But I think the analysis chain is CPU bound in the general case. All that tokenizing, normalizing and stemming uses a lot of CPU cycles.
I'm not in a position to do it, but it might be insightful to do a quick profile of where these two are spending their time. Are we gaining because the algorithm is faster, or because we have less function call overhead, or because of something confounding?
It's mainly that the algorithms are faster. The CaseFolder seems to be especially slow but I have no idea why.
Oprofile on Linux is very easy to use once you have it set up. In case you aren't familiar with it, this is a good intro: http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
I have used it once and found it hard to setup on a virtual machine. But it's very useful if you want to profile long running processes.
Nick
