Hi, I did some bechmark a while ago, and my comments on performance (as well as quality result of highlight) are:
- small document have very little difference, (a few KB) may be RegEx is even faster! hold on... - medium to large document, above 50KB of pure text : -- simple query, for instance 2 or 3 terms in a booleanquery and these terms found at the begin of the document, have no such improvement as (in my implementation) I stop reading the file after a found a few terms. I process 4kb of text at the time, so in this case it's the same as the first case, just a small document as we don't read the full content. -- simple query, but terms are near the end of the document, now VERY HUGE differences are seen. Also, if RegExp is used to the job, and the whole document is treated as a string, I had experience of out of memory after 20 sec of processing!! that's why i decided to move to a more engineered implementation, using a buffer of 4kb at the time (also it is a DLL not java code anymore, with performance 1200% better than same java code). That solution helped me a lot in performance, but as code design prospective, it's really really hard to keep up to date and to extend to new features: I want to implement the 2 phrase search with slop.. a nightmare!! - other case of big improvements having the start/end offset of terms will be the phrasequery with slop, the RegExp dies on this case. - then, for people that want to provide the "google-like" cached version, having termpositions in offsets will make the code much faster (i did tests as i provide such a feature) Although, my highlight implementation as good quality of highlighting (still some little bugs, well..) All the test i did, shown that the bottleneck is RegExp computing: high cpu usage (close to 100%), ram (not a big concern), time (very important). Usually (as an average) 95% of the highlight time is lost there. I am not saing with the termposition offsets the code will be 95% faster, but for sure a big portion of it, we still need to read data from a disk! I run test with different size of document, form 1 KB up to 5MB of pure text; 100 different query, in type and terms. All well studied, to hit different cases (eg: phrasequery with term at middle of the doc, etc). Many cases are possible in this context, thus, highlight may be not very accurate in retriving them. In my opinion, TermPositions offset will be a very good thing to have, as long as it will not impact the overall system, having the end result (search+highlight) without any improvement. Thank you. Paolo Spadafora. ----- Original Message ----- From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Developers List" <[EMAIL PROTECTED]> Sent: Wednesday, February 25, 2004 1:04 PM Subject: Re: Dmitry's Term Vector stuff, plus some > [EMAIL PROTECTED] wrote: > > I'm not sure what applications people have in mind for Term Vector support but I would prefer to have the original text positions (not term sequence positions) stored so I can offer this: > > 1) Significant terms/phrases identification > > Like "Gigabits" on gigablast.com - used to offer choices of (unstemmed) "significant" terms and phrases for query expansion to the end user. > > I would think that this could be done more easily with sequence > positions than with character positions: if you're searching for phrases > you're trying to find are terms which are adjacent. And most web search > engines index unstemmed words. Even if you only indexed stemmed forms, > you'd still need to lowercase and otherwise normalize the text before > extracting words for comparison. > > > 2) Optimised Highlighting > > No more re-tokenizing of text to find unstemmed forms. > > Is this really a performance bottleneck? Have you benchmarked it? > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]