Kauler, Leto S wrote:
Another very cheap, but robust solution in the case you use linux is to make lynx to parse your pages.
lynx page.html > page.txt.
This will strip out all html and script, style, csimport tags. And you will have a .txt file ready for indexing.
Best,
Sergiu
We index the content from HTML files and because we only want the "good" text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered.
Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from <img> though.
--Leto
-----Original Message-----
From: Karl Koch [mailto:[EMAIL PROTECTED]
I have been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc.
I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole.
Karl
> -----Original Message-----
> From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 01, 2005 1:15 AM
> To: [email protected]
> Subject: which HTML parser is better?
> > Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-created by MS-word 'Save As HTML files' function?
>
CONFIDENTIALITY NOTICE AND DISCLAIMER
Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission.
This disclaimer has been automatically added.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
