Hi *Guenter* I think you are right. Although I haven't restarted code, but I have checked the last url I got from that page, which is just in the middle of the page, so it seems that the page has been truncated. Many thanks!
在06-2-17,Guenter, Matthias <[EMAIL PROTECTED]> 写道: > > Hi Elwin > Did you check the content limit? > Otherwise the truncation occurs naturally, I guess > > <property> > <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > Kind regards > > Matthias > -----Ursprüngliche Nachricht----- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 17. Februar 2006 09:36 > An: [email protected] > Betreff: Re: extract links problem with parse-html plugin > > I have wrote a test class HtmlWrapper and here is some code: > > HtmlWrapper wrapper=new HtmlWrapper(); > Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html"); > String temp=new String(c.getContent()); > System.out.println(temp); > > wrapper.parseHttpContent(c); // get all outlinks into a ArrayList > ArrayList links=wrapper.getBlogLinks(); > for(int i=0;i<links.size();i++){ > String urlString=(String)links.get(i); > System.out.println(urlString); > } > > I can only get a few of links from that page. > > The url is from a Chinese site; however you can just skip those > non-Enligsh > contents and just see the html elements. > > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>: > > > > Hi Elwin > > Can you provide samples of not working links and code? And put it into > > JIRA? > > Kind regards > > Matthias > > > > > > > > -----Ursprüngliche Nachricht----- > > Von: Elwin [mailto:[EMAIL PROTECTED] > > Gesendet: Fr 17.02.2006 08:51 > > An: [email protected] > > Betreff: extract links problem with parse-html plugin > > > > It seems that the parse-html plguin may not process many pages well, > > because > > I have found that the plugin can't extract all valid links in a page > when > > I > > test it in my code. > > I guess that it may be caused by the style of a html page? When I "view > > source" of a html page I used to parse, I saw that some elements in the > > source are segmented by some unrequired spaces. However, the situation > is > > quiet often to the pages of large portal sites or news sites. > > > > > > > -- > 《盖世豪侠》好评如潮,让无线收视居高不下, > 无线高兴之余,仍未重用。周星驰岂是池中物, > 喜剧天分既然崭露,当然不甘心受冷落,于是 > 转投电影界,在大银幕上一展风采。无线既得 > 千里马,又失千里马,当然后悔莫及。 > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
