Hi, is there any possibility to do sentence-level or document level analysis with the current Analysis/TokenStream architecture? Or where else is the best place to plug in customised document-level and sentence-level analysis features? Is there any "precedence case" ?
My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this "punctuation", a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing with this (and originally not intended, I guess). On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by converters from other formats that output unwanted chars (from figures, pagenumbers, etc.), which are filtered out anyway by my custom Analyzer. Any idea how to solve this second problem? Is there any support for such document / sentence structure analysis planned? Thanks and regards, Gregor --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
