Hi, afaics, Julien is right. It's possible to check it via:
bin/nutch parsechecker -Dhttp.content.limit=-1 -dumpText \ 'http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2' With -Dhttp.content.limit=65534 (also the default) the content is truncated. Best, Sebastian On 09/17/2014 11:32 AM, Julien Nioche wrote: > Hi > > Isn't that an effect of > > <property> <name>http.content.limit</name> > <value>65536</value> > <description>The length limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > > > I can't reproduce the problem as http://search.dangdang.com/ seems to be down. > > Do you have another URL to illustrate the issue? > > J. > > On 16 September 2014 15:59, zeroleaf <[email protected] > <mailto:[email protected]>> wrote: > > These days, when I use nutch, I found that if the Transfer Dncoding is > chunked, then nutch will not fetch the whole page and only part of it. Is > it > right in nutch or is it a bug? If it is right, then how to config to > fetch the > whole page? > > For example, add the url below to seed dir > > http://search.dangdang.com/?__key=%CA%FD%BE%DD%BF%E2 > <http://search.dangdang.com/?key=%CA%FD%BE%DD%BF%E2> > > then, find fetched html in content, will find it is only a part. In > addition, the > version I test is Nutch 1.x(1.9 and 1.10). > > Thanks. > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

