Prasanth Iyer created NUTCH-1892:
------------------------------------
Summary: Update the FileDumper tool to fetch only those URLs with
status db_fetched in nutch
Key: NUTCH-1892
URL: https://issues.apache.org/jira/browse/NUTCH-1892
Project: Nutch
Issue Type: Improvement
Components: nutchNewbie
Affects Versions: 2.2.1
Reporter: Prasanth Iyer
The FileDumper tool is a tool that reads the crawled data from Nutch and dumps
this data into its raw files. This tool currently dumps every single file
irrespective of status, duplicates etc. This cause files that are fetched in
error or files that have not been fetched because they were made unavailable by
the server to also be dumped.
The fix should be to fetch only those files that were fetched with status
db_fetched by Nutch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)