Re: extract links problem with parse-html plugin

Elwin Fri, 17 Feb 2006 05:35:29 -0800

Hi *Guenter*

 I think you are right. Although I haven't restarted code, but I have
checked the last url I got from that page, which is just in the middle of
the page, so it seems that the page has been truncated.
Many thanks!



在06-2-17，Guenter, Matthias <[EMAIL PROTECTED]> 写道：
>
> Hi Elwin
> Did you check the content limit?
> Otherwise the truncation occurs naturally, I guess
>
> <property>
> <name>http.content.limit</name>
> <value>65536</value>
> <description>The length limit for downloaded content, in bytes.
> If this value is nonnegative (>=0), content longer than it will be
> truncated;
> otherwise, no truncation at all.
> </description>
> </property>
>
> Kind regards
>
> Matthias
> -----Ursprüngliche Nachricht-----
> Von: Elwin [mailto:[EMAIL PROTECTED]
> Gesendet: Freitag, 17. Februar 2006 09:36
> An: [email protected]
> Betreff: Re: extract links problem with parse-html plugin
>
> I have wrote a test class HtmlWrapper and here is some code:
>
> HtmlWrapper wrapper=new HtmlWrapper();
> Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html";);
> String temp=new String(c.getContent());
> System.out.println(temp);
>
> wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
> ArrayList links=wrapper.getBlogLinks();
> for(int i=0;i<links.size();i++){
>   String urlString=(String)links.get(i);
>   System.out.println(urlString);
> }
>
> I can only get a few of links from that page.
>
> The url is from a Chinese site; however you can just skip those
> non-Enligsh
> contents and just see the html elements.
>
> 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>:
> >
> > Hi Elwin
> > Can you provide samples of not working links and code? And put it into
> > JIRA?
> > Kind regards
> > Matthias
> >
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Elwin [mailto:[EMAIL PROTECTED]
> > Gesendet: Fr 17.02.2006 08:51
> > An: [email protected]
> > Betreff: extract links problem with parse-html plugin
> >
> > It seems that the parse-html plguin may not process many pages well,
> > because
> > I have found that the plugin can't extract all valid links in a page
> when
> > I
> > test it in my code.
> > I guess that it may be caused by the style of a html page? When I "view
> > source" of a html page I used to parse, I saw that some elements in the
> > source are segmented by some unrequired spaces. However, the situation
> is
> > quiet often to the pages of large portal sites or news sites.
> >
> >
>
>
> --
> 《盖世豪侠》好评如潮，让无线收视居高不下，
> 无线高兴之余，仍未重用。周星驰岂是池中物，
> 喜剧天分既然崭露，当然不甘心受冷落，于是
> 转投电影界，在大银幕上一展风采。无线既得
> 千里马，又失千里马，当然后悔莫及。
>



--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: extract links problem with parse-html plugin

Reply via email to