Re: AW: extract links problem with parse-html plugin

Piotr Kosiorowski Mon, 20 Feb 2006 01:48:00 -0800

Hello,
One more thing to check:
<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  </description>
</property>


Regards
Piotr
Guenter, Matthias wrote:
> Hi Elwin
> Did you check the content limit?
> Otherwise the truncation occurs naturally, I guess
> 
> <property>
>   <name>http.content.limit</name>
>   <value>65536</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be 
> truncated;
>   otherwise, no truncation at all.
>  </description> 
> </property>
> 
> Kind regards
> 
> Matthias
> -----Ursprüngliche Nachricht-----
> Von: Elwin [mailto:[EMAIL PROTECTED] 
> Gesendet: Freitag, 17. Februar 2006 09:36
> An: [email protected]
> Betreff: Re: extract links problem with parse-html plugin
> 
> I have wrote a test class HtmlWrapper and here is some code:
> 
>   HtmlWrapper wrapper=new HtmlWrapper();
>   Content c=getHttpContent("http://blog.sina.com.cn/lm/hot/index.html";);
>   String temp=new String(c.getContent());
>   System.out.println(temp);
> 
>   wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
>   ArrayList links=wrapper.getBlogLinks();
>   for(int i=0;i<links.size();i++){
>    String urlString=(String)links.get(i);
>    System.out.println(urlString);
>   }
> 
> I can only get a few of links from that page.
> 
> The url is from a Chinese site; however you can just skip those non-Enligsh
> contents and just see the html elements.
> 
> 2006/2/17, Guenter, Matthias <[EMAIL PROTECTED]>:
>> Hi Elwin
>> Can you provide samples of not working links and code? And put it into
>> JIRA?
>> Kind regards
>> Matthias
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Elwin [mailto:[EMAIL PROTECTED]
>> Gesendet: Fr 17.02.2006 08:51
>> An: [email protected]
>> Betreff: extract links problem with parse-html plugin
>>
>> It seems that the parse-html plguin may not process many pages well,
>> because
>> I have found that the plugin can't extract all valid links in a page when
>> I
>> test it in my code.
>> I guess that it may be caused by the style of a html page? When I "view
>> source" of a html page I used to parse, I saw that some elements in the
>> source are segmented by some unrequired spaces. However, the situation is
>> quiet often to the pages of large portal sites or news sites.
>>
>>
> 
> 
> --
> 《盖世豪侠》好评如潮，让无线收视居高不下，
> 无线高兴之余，仍未重用。周星驰岂是池中物，
> 喜剧天分既然崭露，当然不甘心受冷落，于是
> 转投电影界，在大银幕上一展风采。无线既得
> 千里马，又失千里马，当然后悔莫及。
>

Re: AW: extract links problem with parse-html plugin

Reply via email to