Re: Summarization; sentence-level and document-level filters.

maurits van wijland Wed, 17 Dec 2003 02:16:45 -0800

Gregor,

I don't have any benchmarks for summarization. Sorry!
I have two testversions of commercial summarizers and
their performance is better than the Classifier4J, but these
have been written in C++. So you can't compare properly.


regards,
Maurits


----- Original Message ----- 
From: "Gregor Heinrich" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, December 16, 2003 9:35 PM
Subject: RE: Summarization; sentence-level and document-level filters.


> Maurits: thanks for the hint to classifier4j -- I have had a look on this
> package and tried the SimpleSummarizer and it seems to work fine.
(However,
> as I don't know the benchmarks for summarization, I'm not the one to
judge.)
>
> Do you have experience with it?
>
> Gregor
>
> -----Original Message-----
> From: maurits van wijland [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 16, 2003 1:09 AM
> To: Lucene Users List; [EMAIL PROTECTED]
> Subject: Re: Summarization; sentence-level and document-level filters.
>
>
> Hi Gregor,
>
> Sofar as I know there is no summarizer in the plans. And maybe I can help
> you along the way. Have a look
> at Classifier4J project on Sourceforge.
>
> http://classifier4j.sourceforge.net/
>
> It has a small documetn summarizer besides a bayes classifier.It might
speed
> up your coding.
>
> On the level of lucene, I have no idea. My gut feeling says that a summary
> should be build before the
> text is tokenized! The tokenizer can ofcourse be used when analysing a
> document, but hooking into
> the lucene indexing is a bad idea I think.
>
> Someone else has any ideas?
>
> regards,
>
> Maurits
>
>
>
>
> ----- Original Message -----
> From: "Gregor Heinrich" <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Monday, December 15, 2003 7:41 PM
> Subject: Summarization; sentence-level and document-level filters.
>
>
> > Hi,
> >
> > is there any possibility to do sentence-level or document level analysis
> > with the current Analysis/TokenStream architecture? Or where else is the
> > best place to plug in customised document-level and sentence-level
> analysis
> > features? Is there any "precedence case" ?
> >
> > My technical problem:
> >
> > I'd like to include a summarization feature into my system, which should
> (1)
> > best make use of the architecture already there in Lucene, and (2)
should
> be
> > able to trigger summarization on a per-document basis while requiring
> > sentence-level information, such as full-stops and commas. To preserve
> this
> > "punctuation", a special Tokenizer can be used that outputs such
landmarks
> > as tokens instead of filtering them out. The actual SummaryFilter then
> > filters out the punctuation for its successors in the Analyzer's filter
> > chain.
> >
> > The other, more complex thing is the document-level information: As
> Lucene's
> > architecture uses a filter concept that does not know about the document
> the
> > tokens are generated from (which is good abstraction), a
document-specific
> > operation like summarization is a bit of an awkward thing with this (and
> > originally not intended, I guess). On the other hand, I'd like to have
the
> > existing filter structure in place for preprocessing of the input,
because
> > my raw texts are generated by converters from other formats that output
> > unwanted chars (from figures, pagenumbers, etc.), which are filtered out
> > anyway by my custom Analyzer.
> >
> > Any idea how to solve this second problem? Is there any support for such
> > document / sentence structure analysis planned?
> >
> > Thanks and regards,
> >
> > Gregor
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Summarization; sentence-level and document-level filters.

Reply via email to