On 08/12/11 20:04, Nathan Kurz wrote:
I'm mostly listening in on this conversation because I haven't thought
much about indexing, but the magnitude of improvement here surprises
me:  I wouldn't have thought that there would be that much time to
shave off!    My presumption was that everything would be dominated by
Disk IO, and that the actual tokenizing time would be tiny.   Are
these numbers both working within memory with a pre-warmed cache so no
disk reads are involved?  Also, have you controlled for whether the
data is sync'ed to disk after the indexing?

These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU. But I think the analysis chain is CPU bound in the general case. All that tokenizing, normalizing and stemming uses a lot of CPU cycles.

I'm not in a position to do it, but it might be insightful to do a
quick profile of where these two are spending their time.  Are we
gaining because the algorithm is faster, or because we have less
function call overhead, or because of something confounding?

It's mainly that the algorithms are faster. The CaseFolder seems to be especially slow but I have no idea why.

Oprofile
on Linux is very easy to use once you have it set up.  In case you
aren't familiar with it, this is a good intro:
http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.

I have used it once and found it hard to setup on a virtual machine. But it's very useful if you want to profile long running processes.

Nick

Reply via email to