I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
constraints. I do not think CPU performance is a big issues though. But I
have other parts in my application which use quite a lot of memory and
soemthing run short. I therefore do not look into solutions which build up
tag trees etc. More like a solution who reads a stream of HTML and
transforms it into a stream of text.
I see your point of using an external program. I am however not entirely
sure if this is available. Also it would be much simpler to have a 3-5 kB
solution in Java, perhaps encapsulated in a class which does the job without
the need for advanced libraries which need 100-200 KB on my internal
storage.
I hope I could clarify my situation now.
Cheers,
Karl
> Karl Koch wrote:
>
> >Hello Sergiu,
> >
> >thank you for your help so far. I appreciate it.
> >
> >I am working with Java 1.1 which does not include regular expressions.
> >
> >
> Why are you using Java 1.1? Are you so limited in resources?
> What operating system do you use?
> I asume that you just need to index the html files, and you need a
> html2txt conversion.
> If an external converter si a solution for you, you can use
> Runtime.executeCommnand(...) to run the converter that will extract the
> information from your HTMLs
> and generate a .txt file. Then you can use a reader to index the txt.
>
> As I told you before, the best solution depends on your constraints
> (time, effort, hardware, performance) and requirements :)
>
> Best,
>
> Sergiu
>
> >Your turn ;-)
> >Karl
> >
> >
> >
> >>Karl Koch wrote:
> >>
> >>
> >>
> >>>I am in control of the html, which means it is well formated HTML. I
> use
> >>>only HTML files which I have transformed from XML. No external HTML
> (e.g.
> >>>the web).
> >>>
> >>>Are there any very-short solutions for that?
> >>>
> >>>
> >>>
> >>>
> >>if you are using only correct formated HTML pages and you are in control
> >>of these pages.
> >>you can use a regular exprestion to remove the tags.
> >>
> >>something like
> >>replaceAll("<*>","");
> >>
> >>This is the ideea behind the operation. If you will search on google you
> >>will find a more robust
> >>regular expression.
> >>
> >>Using a simple regular expression will be a very cheap solution, that
> >>can cause you a lot of problems in the future.
> >>
> >> It's up to you to use it ....
> >>
> >> Best,
> >>
> >> Sergiu
> >>
> >>
> >>
> >>>Karl
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hi,
> >>>>>
> >>>>>yes, but the library your are using is quite big. I was thinking that
> a
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>5kB
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>code could actually do that. That sourceforge project is doing much
> >>>>>
> >>>>>
> >>more
> >>
> >>
> >>>>>than that but I do not need it.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>you need just the htmlparser.jar 200k.
> >>>>... you know ... the functionality is strongly correclated with the
> >>>>
> >>>>
> >>size.
> >>
> >>
> >>>> You can use 3 lines of code with a good regular expresion to
> eliminate
> >>>>the html tags,
> >>>>but this won't give you any guarantie that the text from the bad
> >>>>fromated html files will be
> >>>>correctly extracted...
> >>>>
> >>>> Best,
> >>>>
> >>>> Sergiu
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>Hi Karl,
> >>>>>>
> >>>>>>I already submitted a peace of code that removes the html tags.
> >>>>>>Search for my previous answer in this thread.
> >>>>>>
> >>>>>>Best,
> >>>>>>
> >>>>>> Sergiu
> >>>>>>
> >>>>>>Karl Koch wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>Hello,
> >>>>>>>
> >>>>>>>I have been following this thread and have another question.
> >>>>>>>
> >>>>>>>Is there a piece of sourcecode (which is preferably very short and
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>simple
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>(KISS)) which allows to remove all HTML tags from HTML content?
> HTML
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>3.2
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>would be enough...also no frames, CSS, etc.
> >>>>>>>
> >>>>>>>I do not need to have the HTML strucutre tree or any other
> structure
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>but
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>need a facility to clean up HTML into its normal underlying content
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>before
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>indexing that content as a whole.
> >>>>>>>
> >>>>>>>Karl
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>I think that depends on what you want to do. The Lucene demo
> parser
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>does
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>simple mapping of HTML files into Lucene Documents; it does not
> give
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>you
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>a
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>parse tree for the HTML doc. CyberNeko is an extension of Xerces
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>(uses
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>the
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>same API; will likely become part of Xerces), and so maps an HTML
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>document
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>into a full DOM that you can manipulate easily for a wide range of
> >>>>>>>>purposes. I haven't used JTidy at an API level and so don't know
> it
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>as
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>well --
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>based on its UI, it appears to be focused primarily on HTML
> >>>>>>>>
> >>>>>>>>
> >>validation
> >>
> >>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>and
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>error detection/correction.
> >>>>>>>>
> >>>>>>>>I use CyberNeko for a range of operations on HTML documents that
> go
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>beyond
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>indexing them in Lucene, and really like it. It has been robust
> for
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>me
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>so
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>far.
> >>>>>>>>
> >>>>>>>>Chuck
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>-----Original Message-----
> >>>>>>>>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> >>>>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>>>>>>>To: [email protected]
> >>>>>>>>>Subject: which HTML parser is better?
> >>>>>>>>>
> >>>>>>>>>Three HTML parsers(Lucene web application
> >>>>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>>>>>>>Lucene FAQ
> >>>>>>>>>1.3.27.Which is the best?Can it filter tags that are
> >>>>>>>>>auto-created by MS-word 'Save As HTML files' function?
> >>>>>>>>>
> >>>>>>>>>_________________________________________________________
> >>>>>>>>>Do You Yahoo!?
> >>>>>>>>>150����MP3����ѣ������������ֵ���
> >>>>>>>>>http://music.yisou.com/
> >>>>>>>>>��Ů����Ӧ�о��У��ѱ���ͼ����ͼ�Ϳ�ͼ
> >>>>>>>>>http://image.yisou.com
> >>>>>>>>>1G����1000�ף��Ż������������ݣ�
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
>
>>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>il_1g/
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>>>To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >>>>>>>>>For additional commands, e-mail:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>[EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>
> >>>>
>
>>>>>>>---------------------------------------------------------------------
> >>>>>>>
> >>>>>>>
> >>>>>>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>>>For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
--
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]