Gregor, I don't have any benchmarks for summarization. Sorry! I have two testversions of commercial summarizers and their performance is better than the Classifier4J, but these have been written in C++. So you can't compare properly.
regards, Maurits ----- Original Message ----- From: "Gregor Heinrich" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Tuesday, December 16, 2003 9:35 PM Subject: RE: Summarization; sentence-level and document-level filters. > Maurits: thanks for the hint to classifier4j -- I have had a look on this > package and tried the SimpleSummarizer and it seems to work fine. (However, > as I don't know the benchmarks for summarization, I'm not the one to judge.) > > Do you have experience with it? > > Gregor > > -----Original Message----- > From: maurits van wijland [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 16, 2003 1:09 AM > To: Lucene Users List; [EMAIL PROTECTED] > Subject: Re: Summarization; sentence-level and document-level filters. > > > Hi Gregor, > > Sofar as I know there is no summarizer in the plans. And maybe I can help > you along the way. Have a look > at Classifier4J project on Sourceforge. > > http://classifier4j.sourceforge.net/ > > It has a small documetn summarizer besides a bayes classifier.It might speed > up your coding. > > On the level of lucene, I have no idea. My gut feeling says that a summary > should be build before the > text is tokenized! The tokenizer can ofcourse be used when analysing a > document, but hooking into > the lucene indexing is a bad idea I think. > > Someone else has any ideas? > > regards, > > Maurits > > > > > ----- Original Message ----- > From: "Gregor Heinrich" <[EMAIL PROTECTED]> > To: "'Lucene Users List'" <[EMAIL PROTECTED]> > Sent: Monday, December 15, 2003 7:41 PM > Subject: Summarization; sentence-level and document-level filters. > > > > Hi, > > > > is there any possibility to do sentence-level or document level analysis > > with the current Analysis/TokenStream architecture? Or where else is the > > best place to plug in customised document-level and sentence-level > analysis > > features? Is there any "precedence case" ? > > > > My technical problem: > > > > I'd like to include a summarization feature into my system, which should > (1) > > best make use of the architecture already there in Lucene, and (2) should > be > > able to trigger summarization on a per-document basis while requiring > > sentence-level information, such as full-stops and commas. To preserve > this > > "punctuation", a special Tokenizer can be used that outputs such landmarks > > as tokens instead of filtering them out. The actual SummaryFilter then > > filters out the punctuation for its successors in the Analyzer's filter > > chain. > > > > The other, more complex thing is the document-level information: As > Lucene's > > architecture uses a filter concept that does not know about the document > the > > tokens are generated from (which is good abstraction), a document-specific > > operation like summarization is a bit of an awkward thing with this (and > > originally not intended, I guess). On the other hand, I'd like to have the > > existing filter structure in place for preprocessing of the input, because > > my raw texts are generated by converters from other formats that output > > unwanted chars (from figures, pagenumbers, etc.), which are filtered out > > anyway by my custom Analyzer. > > > > Any idea how to solve this second problem? Is there any support for such > > document / sentence structure analysis planned? > > > > Thanks and regards, > > > > Gregor > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
