We index the content from HTML files and because we only want the "good" text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered.
Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from <img> though. --Leto > -----Original Message----- > From: Karl Koch [mailto:[EMAIL PROTECTED] > > I have been following this thread and have another question. > > Is there a piece of sourcecode (which is preferably very > short and simple > (KISS)) which allows to remove all HTML tags from HTML > content? HTML 3.2 would be enough...also no frames, CSS, etc. > > I do not need to have the HTML strucutre tree or any other > structure but need a facility to clean up HTML into its > normal underlying content before indexing that content as a whole. > > Karl > > > > > > -----Original Message----- > > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 01, 2005 1:15 AM > > > To: [email protected] > > > Subject: which HTML parser is better? > > > > > > Three HTML parsers(Lucene web application > > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > > Lucene FAQ > > > 1.3.27.Which is the best?Can it filter tags that are > > > auto-created by MS-word 'Save As HTML files' function? > > > CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
