Does Limiting the (ftp|http).content.size Affect the Parsing and Deduplication?

Renxia Wang Sun, 15 Feb 2015 10:12:55 -0800

Hi all,

I am running Nutch on may own laptop and I'd like to set a limit for the
(ftp|http).content.size so that the crawl will not be downloading huge file
for a long time and possibly cause java heap size issue. However, I wonder
if downloading the files(especially those compressed file, like zip, rar,
etc) partially can fail the parsing and deduplication processing, as the
file is incomplete?


Thanks,

Renxia

Does Limiting the (ftp|http).content.size Affect the Parsing and Deduplication?

Reply via email to