Process Sitemap data in text, rss or xml format as well as OAI-PMH
------------------------------------------------------------------

         Key: NUTCH-158
         URL: http://issues.apache.org/jira/browse/NUTCH-158
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: byron miller
    Priority: Minor


Add support to the fetcher to look for sitemap files, download them and process 
them into webdb.

Perhaps create a robots.txt directive that can be used to create a standard 
format for sitemaps in RSS, XML or text format (one line per url) and process 
that.

I would love to see someone stomp on proprietary sitemap features or making 
things so google specific as they are today :)

* RSS format/Atom Format (standard)
* XML meta descroption
* OAI-PMH meta description 
(http://www.openarchives.org/OAI/openarchivesprotocol.html)

Perhaps even a "pre crawler" that will scour for these to inject into the web 
db to help build your link map so you could even just index topN.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to