Summarization; sentence-level and document-level filters.

Gregor Heinrich Mon, 15 Dec 2003 06:40:16 -0800

Hi,

where is the best place to plug in customised document-level and
sentence-level analysis features into Lucene's index analyzer and filter
architecture? Is there any "precedence case" ?


My technical problem:

I'd like to include a summarization feature into my system, which should (1)
best make use of the architecture already there in Lucene, namely the
Analyzer, and (2) should be able to trigger summarization on a per-document
basis while requiring sentence-level information, such as full-stops and
commas. To preserve this "punctuation", a special Tokenizer can be used that
outputs such landmarks as tokens instead of filtering them out. The actual
SummaryFilter then filters out the punctuation for its successors in the
Analyzer's filter chain.

The other, more complex thing is the document-level information: As Lucene's
architecture uses a filter concept that does not know about the document the
tokens are generated from (which is good abstraction), a document-specific
operation like summarization is a bit of an awkward thing. On the other
hand, I'd like to have the existing filter structure in place for
preprocessing of the input, because my raw texts are generated by PDFBox and
the like, and a lot of unwanted chars need to be filtered out prior to
summarization, with a filter which accepts tokens accoding to a pass and
reject pattern.)

One thing I came up with was to have a Summary object in the SummaryFilter
that can be accessed from the top-level of the indexing process and can be
read out after the Document is created, using a summarization trigger
command. However, as inversion of term lists is done only after calling
IndexWriter.add(Document), the summarization trigger would need to be called
by addDocument(), which would require an extension of Lucene at a deeper
level.

Any idea how to solve this second problem?

Thanks and regards,

Gregor



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Summarization; sentence-level and document-level filters.

Reply via email to