can you share a description of the heuristics you used to clean up the text? i am facing the same problem right now handling email. i'm not interested in the rules you use as much as the tools you use to implement the rules.
Herb.... -----Original Message----- From: Ulrich Mayring [mailto:[EMAIL PROTECTED] Sent: Friday, November 28, 2003 4:21 AM To: [EMAIL PROTECTED] Subject: Re: New Lucene-powered Website This "clean-up work" is actually trickier than the summarising itself and it is usually very domain-specific. That's the reason why I haven't proposed to contribute the summariser to Lucene, because the clean-up code is not generic. The summariser itself is just one class with 300 lines, but without prior clean-up the quality of its summaries is insufficient. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
