Moreno Feltscher created NUTCH-2491:
---------------------------------------
Summary: Integrate sitemap processing and HostDB into crawl script
Key: NUTCH-2491
URL: https://issues.apache.org/jira/browse/NUTCH-2491
Project: Nutch
Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher
Priority: Minor
Add three new steps to the crawl bash script:
1. Generate HostDB from CrawlDB
2. Inject URLs from sitemaps URLs found in hosts from HostDb
3. If given, inject sitemap URLs specified in a configuration file / in
configuration files
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)