Re: AW: extract links problem with parse-html plugin

Elwin Mon, 20 Feb 2006 04:28:05 -0800

Yes, it's true, although it's not the cause of my problem.

在06-2-20，Piotr Kosiorowski <[EMAIL PROTECTED]> 写道：
>
> Hello,
> One more thing to check:
> <property>
> <name>db.max.outlinks.per.page</name>
> <value>100</value>
> <description>The maximum number of outlinks that we'll process for a page.
> </description>
> </property>
>
> Regards
> Piotr
> Guenter, Matthias wrote:
> > Hi Elwin
> > Did you check the content limit?
> > Otherwise the truncation occurs naturally, I guess
> >
> > <property>
> >   <name>http.content.limit</name>
> >   <value>65536</value>
> >   <description>The length limit for downloaded content, in bytes.
> >   If this value is nonnegative (>=0), content longer than it will be
> truncated;
> >   otherwise, no truncation at all.
> >  </description>
> > </property>
> >
> > Kind regards
> >
> > Matthias
> > -----Ursprüngliche Nachricht-----
> > Von: Elwin [mailto:[EMAIL PROTECTED]
> > Gesendet: Freitag, 17. Februar 2006 09:36
> > An: [email protected]
> > Betreff: Re: extract links problem with parse-html plugin
> >
> > I have wrote a test class HtmlWrapper and here is some code:
> >
> >   HtmlWrapper wrapper=new HtmlWrapper();
> >   Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html";);
> >   String temp=new String(c.getContent());
> >   System.out.println(temp);
> >
> >   wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
> >   ArrayList links=wrapper.getBlogLinks();
> >   for(int i=0;i<links.size();i++){
> >    String urlString=(String)links.get(i);
> >    System.out.println(urlString);
> >   }
> >
> > I can only get a few of links from that page.
> >
> > The url is from a Chinese site; however you can just skip those
> non-Enligsh
> > contents and just see the html elements.
> >
> > 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>:
> >> Hi Elwin
> >> Can you provide samples of not working links and code? And put it into
> >> JIRA?
> >> Kind regards
> >> Matthias
> >>
> >>
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: Elwin [mailto:[EMAIL PROTECTED]
> >> Gesendet: Fr 17.02.2006 08:51
> >> An: [email protected]
> >> Betreff: extract links problem with parse-html plugin
> >>
> >> It seems that the parse-html plguin may not process many pages well,
> >> because
> >> I have found that the plugin can't extract all valid links in a page
> when
> >> I
> >> test it in my code.
> >> I guess that it may be caused by the style of a html page? When I "view
> >> source" of a html page I used to parse, I saw that some elements in the
> >> source are segmented by some unrequired spaces. However, the situation
> is
> >> quiet often to the pages of large portal sites or news sites.
> >>
> >>
> >
> >
> > --
> > 《盖世豪侠》好评如潮，让无线收视居高不下，
> > 无线高兴之余，仍未重用。周星驰岂是池中物，
> > 喜剧天分既然崭露，当然不甘心受冷落，于是
> > 转投电影界，在大银幕上一展风采。无线既得
> > 千里马，又失千里马，当然后悔莫及。
> >
>
>
>



--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

Reply via email to