Hello, One more thing to check: <property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>The maximum number of outlinks that we'll process for a page. </description> </property>
Regards Piotr Guenter, Matthias wrote: > Hi Elwin > Did you check the content limit? > Otherwise the truncation occurs naturally, I guess > > <property> > <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > Kind regards > > Matthias > -----Ursprüngliche Nachricht----- > Von: Elwin [mailto:[EMAIL PROTECTED] > Gesendet: Freitag, 17. Februar 2006 09:36 > An: [email protected] > Betreff: Re: extract links problem with parse-html plugin > > I have wrote a test class HtmlWrapper and here is some code: > > HtmlWrapper wrapper=new HtmlWrapper(); > Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html"); > String temp=new String(c.getContent()); > System.out.println(temp); > > wrapper.parseHttpContent(c); // get all outlinks into a ArrayList > ArrayList links=wrapper.getBlogLinks(); > for(int i=0;i<links.size();i++){ > String urlString=(String)links.get(i); > System.out.println(urlString); > } > > I can only get a few of links from that page. > > The url is from a Chinese site; however you can just skip those non-Enligsh > contents and just see the html elements. > > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>: >> Hi Elwin >> Can you provide samples of not working links and code? And put it into >> JIRA? >> Kind regards >> Matthias >> >> >> >> -----Ursprüngliche Nachricht----- >> Von: Elwin [mailto:[EMAIL PROTECTED] >> Gesendet: Fr 17.02.2006 08:51 >> An: [email protected] >> Betreff: extract links problem with parse-html plugin >> >> It seems that the parse-html plguin may not process many pages well, >> because >> I have found that the plugin can't extract all valid links in a page when >> I >> test it in my code. >> I guess that it may be caused by the style of a html page? When I "view >> source" of a html page I used to parse, I saw that some elements in the >> source are segmented by some unrequired spaces. However, the situation is >> quiet often to the pages of large portal sites or news sites. >> >> > > > -- > 《盖世豪侠》好评如潮,让无线收视居高不下, > 无线高兴之余,仍未重用。周星驰岂是池中物, > 喜剧天分既然崭露,当然不甘心受冷落,于是 > 转投电影界,在大银幕上一展风采。无线既得 > 千里马,又失千里马,当然后悔莫及。 >
