RE: Summarization; sentence-level and document-level filters.

Gregor Heinrich Tue, 16 Dec 2003 13:39:37 -0800

Maurits: thanks for the hint to classifier4j -- I have had a look on this
package and tried the SimpleSummarizer and it seems to work fine. (However,
as I don't know the benchmarks for summarization, I'm not the one to judge.)


Do you have experience with it?

Gregor

-----Original Message-----
From: maurits van wijland [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 16, 2003 1:09 AM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Summarization; sentence-level and document-level filters.


Hi Gregor,

Sofar as I know there is no summarizer in the plans. And maybe I can help
you along the way. Have a look
at Classifier4J project on Sourceforge.

http://classifier4j.sourceforge.net/

It has a small documetn summarizer besides a bayes classifier.It might speed
up your coding.

On the level of lucene, I have no idea. My gut feeling says that a summary
should be build before the
text is tokenized! The tokenizer can ofcourse be used when analysing a
document, but hooking into
the lucene indexing is a bad idea I think.

Someone else has any ideas?

regards,

Maurits




----- Original Message -----
From: "Gregor Heinrich" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, December 15, 2003 7:41 PM
Subject: Summarization; sentence-level and document-level filters.


> Hi,
>
> is there any possibility to do sentence-level or document level analysis
> with the current Analysis/TokenStream architecture? Or where else is the
> best place to plug in customised document-level and sentence-level
analysis
> features? Is there any "precedence case" ?
>
> My technical problem:
>
> I'd like to include a summarization feature into my system, which should
(1)
> best make use of the architecture already there in Lucene, and (2) should
be
> able to trigger summarization on a per-document basis while requiring
> sentence-level information, such as full-stops and commas. To preserve
this
> "punctuation", a special Tokenizer can be used that outputs such landmarks
> as tokens instead of filtering them out. The actual SummaryFilter then
> filters out the punctuation for its successors in the Analyzer's filter
> chain.
>
> The other, more complex thing is the document-level information: As
Lucene's
> architecture uses a filter concept that does not know about the document
the
> tokens are generated from (which is good abstraction), a document-specific
> operation like summarization is a bit of an awkward thing with this (and
> originally not intended, I guess). On the other hand, I'd like to have the
> existing filter structure in place for preprocessing of the input, because
> my raw texts are generated by converters from other formats that output
> unwanted chars (from figures, pagenumbers, etc.), which are filtered out
> anyway by my custom Analyzer.
>
> Any idea how to solve this second problem? Is there any support for such
> document / sentence structure analysis planned?
>
> Thanks and regards,
>
> Gregor
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Summarization; sentence-level and document-level filters.

Reply via email to