Re: retrieve tokens

Martijn Wed, 22 Dec 2004 12:45:12 -0800

Erik Hatcher wrote:

On Dec 22, 2004, at 12:43 PM, M. Smit wrote:
Consider that you're only highlighting 20 or so entries at one time. Getting the text from a Lucene index you're already navigating will be quite quick. But it shouldn't be too bad to pull 20 records from a database either.

There is one other consideration, and that is to use the new (CVS only) feature of capturing term vectors with position information. The author of the Highlighter, Mark Harwood, has posted in the not too distant past, an update to the Highlighter that can use this position information for highlighting rather than re-analyzing the original text. The re-analysis of the text may be the bottleneck, not the database access.

So indeed the fabrication of the snippet would have to traverse the 200 pages of plain text per hit. I don't think that pulling the records from the database would be the issue.. But I'm somewhat intriqued by the options of leaving out the database altogether (for the presentation of the hits that is) and store this plain text representation of the pdf in Lucene. My worries: how does size (the extra size I would need to store the plain text) effect Lucene operations.. Is the order of N.N or like N.ln(N)?


--
31.69 nHz = once a year


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: retrieve tokens

Reply via email to