Hi, where is the best place to plug in customised document-level and sentence-level analysis features into Lucene's index analyzer and filter architecture? Is there any "precedence case" ?
My technical problem: I'd like to include a summarization feature into my system, which should (1) best make use of the architecture already there in Lucene, namely the Analyzer, and (2) should be able to trigger summarization on a per-document basis while requiring sentence-level information, such as full-stops and commas. To preserve this "punctuation", a special Tokenizer can be used that outputs such landmarks as tokens instead of filtering them out. The actual SummaryFilter then filters out the punctuation for its successors in the Analyzer's filter chain. The other, more complex thing is the document-level information: As Lucene's architecture uses a filter concept that does not know about the document the tokens are generated from (which is good abstraction), a document-specific operation like summarization is a bit of an awkward thing. On the other hand, I'd like to have the existing filter structure in place for preprocessing of the input, because my raw texts are generated by PDFBox and the like, and a lot of unwanted chars need to be filtered out prior to summarization, with a filter which accepts tokens accoding to a pass and reject pattern.) One thing I came up with was to have a Summary object in the SummaryFilter that can be accessed from the top-level of the indexing process and can be read out after the Document is created, using a summarization trigger command. However, as inversion of term lists is done only after calling IndexWriter.add(Document), the summarization trigger would need to be called by addDocument(), which would require an extension of Lucene at a deeper level. Any idea how to solve this second problem? Thanks and regards, Gregor --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
