Kauler, Leto S wrote:

Another very cheap, but robust solution in the case you use linux is to make lynx to parse your pages.

lynx page.html > page.txt.

This will strip out all html and script, style, csimport tags. And you will have a .txt file ready for indexing.

 Best,

 Sergiu

We index the content from HTML files and because we only want the "good"
text and do not care about the structure, well-formedness, etc we went
with regular expressions similar to what Luke Shannon offered.

Only real difference being that we firstly remove entire blocks of
(script|style|csimport) and similar since the contents of those are not
useful for keyword searching, and afterward just remove every leftover
HTML tags.  I have been meaning to add an expression to extract things
like alt attribute text from <img> though.

--Leto





-----Original Message-----
From: Karl Koch [mailto:[EMAIL PROTECTED]


I have been following this thread and have another question.

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc.


I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole.

Karl



> -----Original Message-----
> From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 01, 2005 1:15 AM
> To: [email protected]
> Subject: which HTML parser is better?
> > Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-created by MS-word 'Save As HTML files' function?
>



CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to