Yes, it's true, although it's not the cause of my problem. 在06-2-20,Piotr Kosiorowski <[EMAIL PROTECTED]> 写道: > > Hello, > One more thing to check: > <property> > <name>db.max.outlinks.per.page</name> > <value>100</value> > <description>The maximum number of outlinks that we'll process for a page. > </description> > </property> > > Regards > Piotr > Guenter, Matthias wrote: > > Hi Elwin > > Did you check the content limit? > > Otherwise the truncation occurs naturally, I guess > > > > <property> > > <name>http.content.limit</name> > > <value>65536</value> > > <description>The length limit for downloaded content, in bytes. > > If this value is nonnegative (>=0), content longer than it will be > truncated; > > otherwise, no truncation at all. > > </description> > > </property> > > > > Kind regards > > > > Matthias > > -----Ursprüngliche Nachricht----- > > Von: Elwin [mailto:[EMAIL PROTECTED] > > Gesendet: Freitag, 17. Februar 2006 09:36 > > An: [email protected] > > Betreff: Re: extract links problem with parse-html plugin > > > > I have wrote a test class HtmlWrapper and here is some code: > > > > HtmlWrapper wrapper=new HtmlWrapper(); > > Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html"); > > String temp=new String(c.getContent()); > > System.out.println(temp); > > > > wrapper.parseHttpContent(c); // get all outlinks into a ArrayList > > ArrayList links=wrapper.getBlogLinks(); > > for(int i=0;i<links.size();i++){ > > String urlString=(String)links.get(i); > > System.out.println(urlString); > > } > > > > I can only get a few of links from that page. > > > > The url is from a Chinese site; however you can just skip those > non-Enligsh > > contents and just see the html elements. > > > > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>: > >> Hi Elwin > >> Can you provide samples of not working links and code? And put it into > >> JIRA? > >> Kind regards > >> Matthias > >> > >> > >> > >> -----Ursprüngliche Nachricht----- > >> Von: Elwin [mailto:[EMAIL PROTECTED] > >> Gesendet: Fr 17.02.2006 08:51 > >> An: [email protected] > >> Betreff: extract links problem with parse-html plugin > >> > >> It seems that the parse-html plguin may not process many pages well, > >> because > >> I have found that the plugin can't extract all valid links in a page > when > >> I > >> test it in my code. > >> I guess that it may be caused by the style of a html page? When I "view > >> source" of a html page I used to parse, I saw that some elements in the > >> source are segmented by some unrequired spaces. However, the situation > is > >> quiet often to the pages of large portal sites or news sites. > >> > >> > > > > > > -- > > 《盖世豪侠》好评如潮,让无线收视居高不下, > > 无线高兴之余,仍未重用。周星驰岂是池中物, > > 喜剧天分既然崭露,当然不甘心受冷落,于是 > > 转投电影界,在大银幕上一展风采。无线既得 > > 千里马,又失千里马,当然后悔莫及。 > > > > >
-- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
