Hi all,

I am running Nutch on may own laptop and I'd like to set a limit for the
(ftp|http).content.size so that the crawl will not be downloading huge file
for a long time and possibly cause java heap size issue. However, I wonder
if downloading the files(especially those compressed file, like zip, rar,
etc) partially can fail the parsing and deduplication processing, as the
file is incomplete?

Thanks,

Renxia

Reply via email to