[ http://issues.apache.org/jira/browse/NUTCH-158?page=comments#action_12365483 ]
raghavendra prabhu commented on NUTCH-158: ------------------------------------------ This is an important thing We should automaticall be able to insert the links parsed out of site map into webdb But currently if we enable parse-rss and crawl these links ,dont they get added > Process Sitemap data in text, rss or xml format as well as OAI-PMH > ------------------------------------------------------------------ > > Key: NUTCH-158 > URL: http://issues.apache.org/jira/browse/NUTCH-158 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.8-dev > Reporter: byron miller > Priority: Minor > > Add support to the fetcher to look for sitemap files, download them and > process them into webdb. > Perhaps create a robots.txt directive that can be used to create a standard > format for sitemaps in RSS, XML or text format (one line per url) and process > that. > I would love to see someone stomp on proprietary sitemap features or making > things so google specific as they are today :) > * RSS format/Atom Format (standard) > * XML meta descroption > * OAI-PMH meta description > (http://www.openarchives.org/OAI/openarchivesprotocol.html) > Perhaps even a "pre crawler" that will scour for these to inject into the web > db to help build your link map so you could even just index topN. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
