can you share a description of the heuristics you used to clean up the text? i am facing the same problem right now handling email. i'm not interested in the rules you use as much as the tools you use to implement the rules.
The tools... well, Java ;-)
The search engine is a custom Java application, which uses Lucene. The heuristics are not very general at this point, they are tailored to our domain. So what you are hinting at (a generic rules description language to customize to the local domain) seems appropriate. Our rules are things like "anything within <h1>...</h1> is an important sentence and we add a full-stop at the end".
Ulrich
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
