When is all this nifty code going to land in trunk? Don't wait for anyone to give you permission Nick, that decision is all yours.
----- Original Message ----- > From: Nick Wellnhofer <[email protected]> > To: [email protected] > Cc: > Sent: Thursday, December 8, 2011 2:43 PM > Subject: Re: [lucy-dev] Some quick benchmarks > > On 08/12/11 20:04, Nathan Kurz wrote: >> I'm mostly listening in on this conversation because I haven't > thought >> much about indexing, but the magnitude of improvement here surprises >> me: I wouldn't have thought that there would be that much time to >> shave off! My presumption was that everything would be dominated by >> Disk IO, and that the actual tokenizing time would be tiny. Are >> these numbers both working within memory with a pre-warmed cache so no >> disk reads are involved? Also, have you controlled for whether the >> data is sync'ed to disk after the indexing? > > These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU. > But I think the analysis chain is CPU bound in the general case. All that > tokenizing, normalizing and stemming uses a lot of CPU cycles. > >> I'm not in a position to do it, but it might be insightful to do a >> quick profile of where these two are spending their time. Are we >> gaining because the algorithm is faster, or because we have less >> function call overhead, or because of something confounding? > > It's mainly that the algorithms are faster. The CaseFolder seems to be > especially slow but I have no idea why. > >> Oprofile >> on Linux is very easy to use once you have it set up. In case you >> aren't familiar with it, this is a good intro: >> > http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/. > > I have used it once and found it hard to setup on a virtual machine. But > it's very useful if you want to profile long running processes. > > Nick >
