Hi,

afaics, Julien is right. It's possible to check it via:

bin/nutch parsechecker -Dhttp.content.limit=-1 -dumpText \
  'http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2'

With -Dhttp.content.limit=65534 (also the default) the content
is truncated.

Best,
Sebastian



On 09/17/2014 11:32 AM, Julien Nioche wrote:
> Hi
> 
> Isn't that an effect of 
> 
>       <property> <name>http.content.limit</name>
>       <value>65536</value>
>       <description>The length limit for downloaded content using the http://
>       protocol, in bytes. If this value is nonnegative (>=0), content longer
>       than it will be truncated; otherwise, no truncation at all. Do not
>       confuse this setting with the file.content.limit setting.
>       </description>
>       </property>
> 
> 
> 
> I can't reproduce the problem as http://search.dangdang.com/ seems to be down.
> 
> Do you have another URL to illustrate the issue?
> 
> J.
> 
> On 16 September 2014 15:59, zeroleaf <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>         These days, when I use nutch, I found that if the Transfer Dncoding is
>     chunked, then nutch will not fetch the whole page and only part of it. Is 
> it
>     right in nutch or is it a bug? If it is right, then how to config to 
> fetch the
>     whole page?
> 
>     For example, add the url below to seed dir
> 
>     http://search.dangdang.com/?__key=%CA%FD%BE%DD%BF%E2
>     <http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2>
> 
>     then, find fetched html in content, will find it is only a part. In 
> addition, the
>     version I test is Nutch 1.x(1.9 and 1.10).
> 
>     Thanks.
> 
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to