hi
 
your attachement is empty, have no java source code in it.

Liao Xuefeng <[EMAIL PROTECTED]> 写道:  hi, all,
  I wrote my own html parser because it just meets my require and do not
depend on 3rd part's lib. and i'd like to share it (in attachment).

  This class provides some static methods to do html <-> text convertion:

  HtmlUtil.html2text(String html);
  HtmlUtil.text2html(String text);

and 
  HtmlUtil.removeScriptTags(String html);
can remove script and activex tags in html, this is use to check user's blog
post before writing into database.

Best regards,
  Xuefeng

http://www.crackj2ee.com

-----Original Message-----
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John Wang wrote:
> Hi Xuefeng:
>
>     Can you please send me your htmlparser too?

Xuefeng, would it be possible to open source your parser?

Thanks

Michi
>
> thanks
>
> -John
>
> On 6/21/06, Daniel Noll  wrote:
>>
>> Simon Courtenage wrote:
>> > I also use htmlparser, which is rather good.  I've had to customize
>> it,
>> > though, to parse strings containing html source rather than accept 
>> > urls of resources to fetch etc.
>> Also it
>> > crashes on meta tags that don't have name attributes (something I 
>> > discovered only a couple of days ago).
>>
>> Actually, it already accepts strings without modifying the library:
>>
>>     String htmlSource = "...";
>>     Parser parser = new Parser(new Lexer(htmlSource));
>>
>> I will have to watch out for those meta tags though.  Time to go test 
>> it.
>>
>> Daniel
>>
>>
>> --
>> Daniel Noll
>>
>> Nuix Pty Ltd
>> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
>> Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902
>>
>> This message is intended only for the named recipient. If you are not 
>> the intended recipient you are notified that disclosing, copying, 
>> distributing or taking any action in reliance on the contents of this 
>> message or attachment is strictly prohibited.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



自动签名:
请使用机器人服务:
msn机器人: [EMAIL PROTECTED]
qq机器人: 443803193
blog: http://blog.csdn.net/accesine960
多么乐主页:homepage: http://www.domolo.com
 









                
---------------------------------
 Mp3疯狂搜-新歌热歌高速下   

Reply via email to