Karl Koch wrote:
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do thatWhy don't you get the data directly from XML files?
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.
You can use a SAX parser, ... but I think it will require java 1.3 or at least 1.2.2
Best,
Sergiu
Karl
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
Otis
--- sergiu gordea <[EMAIL PROTECTED]> wrote:
Karl Koch wrote:
I am in control of the html, which means it is well formated HTML. Iuse
only HTML files which I have transformed from XML. No external HTML(e.g.
if you are using only correct formated HTML pages and you are inthe web).
Are there any very-short solutions for that?
control of these pages.
you can use a regular exprestion to remove the tags.
something like replaceAll("<*>","");
This is the ideea behind the operation. If you will search on google
you will find a more robust
regular expression.
Using a simple regular expression will be a very cheap solution, that
can cause you a lot of problems in the future.
It's up to you to use it ....
Best,
Sergiu
that aKarl
Karl Koch wrote:
Hi,
yes, but the library your are using is quite big. I was thinking
much more5kB
code could actually do that. That sourceforge project is doing
size.you need just the htmlparser.jar 200k.than that but I do not need it.
... you know ... the functionality is strongly correclated with the
eliminateYou can use 3 lines of code with a good regular expresion to
andthe html tags,
but this won't give you any guarantie that the text from the bad fromated html files will be
correctly extracted...
Best,
Sergiu
Karl
Hi Karl,
I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
Hello,
I have been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short
HTMLsimple
(KISS)) which allows to remove all HTML tags from HTML content?
structure3.2
would be enough...also no frames, CSS, etc.
I do not need to have the HTML strucutre tree or any other
contentbut
need a facility to clean up HTML into its normal underlying
parserbefore
indexing that content as a whole.
Karl
I think that depends on what you want to do. The Lucene demo
givedoes
simple mapping of HTML files into Lucene Documents; it does not
Xercesyou
a
parse tree for the HTML doc. CyberNeko is an extension of
HTML(uses
the
same API; will likely become part of Xerces), and so maps an
ofdocument
into a full DOM that you can manipulate easily for a wide range
know itpurposes. I haven't used JTidy at an API level and so don't
validationas
well --
based on its UI, it appears to be focused primarily on HTML
that goand
error detection/correction.
I use CyberNeko for a range of operations on HTML documents
robust forbeyond
indexing them in Lucene, and really like it. It has been
[EMAIL PROTECTED]me
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/maso
far.
Chuck
-----Original Message----- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better?
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function?
_________________________________________________________ Do You Yahoo!? 150万曲MP3疯狂搜,带您闯入音乐殿堂 http://music.yisou.com/ 美女明星应有尽有,搜遍美图、艳图和酷图 http://image.yisou.com 1G就是1000兆,雅虎电邮自助扩容!
---------------------------------------------------------------------
il_1g/
To unsubscribe, e-mail:
[EMAIL PROTECTED][EMAIL PROTECTED]For additional commands, e-mail:
---------------------------------------------------------------------
To unsubscribe, e-mail:
[EMAIL PROTECTED]For additional commands, e-mail:
[EMAIL PROTECTED]---------------------------------------------------------------------
To unsubscribe, e-mail:
[EMAIL PROTECTED]For additional commands, e-mail:
[EMAIL PROTECTED]---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]