On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:

Hello,

I have  been following this thread and have another question.

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc.


I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.



The code in the Lucene Sandbox for parsing HTML with JTidy (under contributions/ant) for the <index> task does what you ask.


        Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to