Re: Dmitry's Term Vector stuff, plus some

Bruce Ritchie Fri, 27 Feb 2004 08:59:54 -0800

Doug Cutting wrote:

Doug, do you believe the storing (as an option of course) of token offset information would be something that you'de accept as a contribution to the core of lucene? Does anyone else think that this would be beneficial information to have?

I have mixed feelings about this. Aesthetically I don't like it a lot, as it is asymmetric: indexes store sequential positions, while vectors would store character offsets. On the other hand, it could be useful for summarizing long documents.

I'm sorry, I wasn't clear in my description. I was thinking of storing the token offset information *in addition* to the sequential positions that were (temporarily?) removed from the term vector code just prior to it being committed, not in exclusion.

Another approach that someone mentioned for solving this problem is to create a fragment index for long documents. For example, if a document is over, say, 32k, then you could create a separate index for it that chops its text into 1000 character overlapping chunks. The first chunk would be characters 0-1000, the next 500-1500, and so on. Then, to summarize, you open this index and search it to figure out which chunks have the best hits. Then you can, based on the chunk document id, seek into the full text and retokenize only selected chunks. Such indexes should be fast to open, since they'd be small. I'd recommend calling IndexWriter#setUseCompoundFile(true) on these, and optimizing them. That way there'd only be a couple of files to open.

In some respects that still doesn't solve my core issue - it just mitigates it for large documents. Retokenization seems to me to be a task that can be done away with the right design. Reducing the time it takes to display search results by a minimum of 75ms (5ms per document x default of 15 documents for my application) and more likely 100-150ms (7-10ms per document) seems to be a worthwhile endeavour. Of course, on a multiprocessor machine I could have the highlight code be multithreaded which would reduce that time somewhat.

I was also thinking about another approach where I store the token offset information in seperate unindexed fields - one new field for token offset information for each original field. I could generate this information in a seperate analyzer run when the document is added to the index. This should satisfy my goal of having the offset information be easily accessible at search time. Coming up with a decent encoding mechanism to store all the term offsets in a single field shouldn't be too difficult. Do you believe this would be a worthwhile approach?

Regards,

Bruce Ritchie
http://www.jivesoftware.com/

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Dmitry's Term Vector stuff, plus some

Reply via email to