Karl Koch wrote:
if you are using only correct formated HTML pages and you are in control of these pages.I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web).
Are there any very-short solutions for that?
you can use a regular exprestion to remove the tags.
something like
replaceAll("<*>","");This is the ideea behind the operation. If you will search on google you will find a more robust
regular expression.
Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future.
It's up to you to use it ....
Best,
Sergiu
Karl
Karl Koch wrote:
5kBHi,
yes, but the library your are using is quite big. I was thinking that a
code could actually do that. That sourceforge project is doing much more than that but I do not need it.
you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size.
You can use 3 lines of code with a good regular expresion to eliminate the html tags,
but this won't give you any guarantie that the text from the bad fromated html files will be
correctly extracted...
Best,
Sergiu
simpleKarl
Hi Karl,
I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
Hello,
I have been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short and
3.2(KISS)) which allows to remove all HTML tags from HTML content? HTML
butwould be enough...also no frames, CSS, etc.
I do not need to have the HTML strucutre tree or any other structure
youneed a facility to clean up HTML into its normal underlying contentbefore
doesindexing that content as a whole.
Karl
I think that depends on what you want to do. The Lucene demo parser
simple mapping of HTML files into Lucene Documents; it does not give
(usesa
parse tree for the HTML doc. CyberNeko is an extension of Xerces
asdocument
the
same API; will likely become part of Xerces), and so maps an HTML
into a full DOM that you can manipulate easily for a wide range of
purposes. I haven't used JTidy at an API level and so don't know it
meand
well --
based on its UI, it appears to be focused primarily on HTML validation
beyonderror detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
indexing them in Lucene, and really like it. It has been robust for
[EMAIL PROTECTED]so
---------------------------------------------------------------------http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mafar.
Chuck
-----Original Message----- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: [email protected] Subject: which HTML parser is better?
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function?
_________________________________________________________ Do You Yahoo!? 150����MP3����ѣ������������ֵ��� http://music.yisou.com/ ��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ http://image.yisou.com 1G����1000�ף��Ż������������ݣ�
il_1g/
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
