Gregor Heinrich wrote:
Yes, copying a summary from one field to an untokenized field was the plan.

I identified DocumentWriter.invertDocument() to be a possible place for an
addition of this document-level analysis. But I admit this appears way too
low-level and inflexible for the overall design.

So I'll make it "two-pass" indexing.

The way I did it: I'm indexing HTML documents, so before Lucene can do anything I need to run a HTML parser. This parser, while scanning the tags, builds two text strings at the same time: one that contains the document content for indexing and one that contains it for summarizing.


There are relevant differences between those two strings, for example in handling of headlines and punctuation. Lucene then gets to index the first string and the second is given to a summarizer. The summary it returns is added to a Lucene field.

This way I can do summarizing and indexing in one pass.

Ulrich



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to