I have made a patch for that purpose ( https://issues.apache.org/jira/browse/NUTCH-1317<https://issues.apache.org/jira/browse/NUTCH-1317?focusedCommentId=13749989&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13749989> ). If you would like , you can set limit by mimetype for nutch-2.1 in nutch-site.xml as follow:
Default limit property: <property> <name>http.content.limit</name> <value>65536</value> </property> For example: application/pdf: <property> <name>http.content.limit.application.pdf</name> <value>1000</value> </property> For example: text/plain: <property> <name>http.content.limit.text.plain</name> <value>1000</value> </property> ...

