Thanks Mark for the explanation. I think your solution would definitely change the tf-idf scoring for documents since your field is now split up over multiple docs. One option to get around the changing scoring would be to to run a completely separate index for highlighting (with the overlapping docs you described). It still seems like storing the offsets would be the most efficient solution since I wouldn't need a new service to do the highlighting.
M On Tue, Feb 3, 2009 at 12:52 PM, markharw00d <markharw...@yahoo.co.uk>wrote: > > Can you describe this in a little more detail; I'm not exactly sure what >> you >> mean. >> >> > > Break your large text documents into multiple Lucene documents. Rather than > dividing them up into entirely discreet chunks of text consider > storing/indexing *overlapping* sections of text with an overlap as big as > the largest "slop" factor you use on Phrase/Span queries so that you don't > cut any potential phrases in half and fail to match e.g. > > This non-overlapping indexing scheme will not match a search for "George > Bush" > > Doc 1 = ".... outgoing president George " > Doc 2= "Bush stated that ..." > > While this overlapping scheme will match... > Doc 1 = ".... outgoing president George " > Doc 2= "president George Bush stated that ..." > > This fragmenting approach helps avoid the performance cost of highlighting > very large documents. > > The remaining issue is to remove duplicates in your search results when you > match multiple chunks e.g. Lucene Docs #1 and #2 both refer to Input Doc#1 > and will match a search for "president". You will need to store a field for > the "original document number" and remove any duplicates (or merge them in > the display if that is what is required). > > Cheers, > Mark > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >