On Thu, Dec 8, 2011 at 10:02 AM, Nick Wellnhofer <[email protected]> wrote: > On 08/12/2011 01:41, Marvin Humphrey wrote: > > Here is more data from a real world indexing run: > > RT+CF: 139 secs > ST+N: 112 secs >
Hi Nick -- I'm mostly listening in on this conversation because I haven't thought much about indexing, but the magnitude of improvement here surprises me: I wouldn't have thought that there would be that much time to shave off! My presumption was that everything would be dominated by Disk IO, and that the actual tokenizing time would be tiny. Are these numbers both working within memory with a pre-warmed cache so no disk reads are involved? Also, have you controlled for whether the data is sync'ed to disk after the indexing? I'm not in a position to do it, but it might be insightful to do a quick profile of where these two are spending their time. Are we gaining because the algorithm is faster, or because we have less function call overhead, or because of something confounding? Oprofile on Linux is very easy to use once you have it set up. In case you aren't familiar with it, this is a good intro: http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/. Thanks! --nate
