Process Sitemap data in text, rss or xml format as well as OAI-PMH
------------------------------------------------------------------
Key: NUTCH-158
URL: http://issues.apache.org/jira/browse/NUTCH-158
Project: Nutch
Type: New Feature
Components: fetcher
Versions: 0.8-dev
Reporter: byron miller
Priority: Minor
Add support to the fetcher to look for sitemap files, download them and process
them into webdb.
Perhaps create a robots.txt directive that can be used to create a standard
format for sitemaps in RSS, XML or text format (one line per url) and process
that.
I would love to see someone stomp on proprietary sitemap features or making
things so google specific as they are today :)
* RSS format/Atom Format (standard)
* XML meta descroption
* OAI-PMH meta description
(http://www.openarchives.org/OAI/openarchivesprotocol.html)
Perhaps even a "pre crawler" that will scour for these to inject into the web
db to help build your link map so you could even just index topN.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers