[jira] Commented: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

raghavendra prabhu (JIRA) Tue, 07 Feb 2006 13:03:49 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-158?page=comments#action_12365483 ]


raghavendra prabhu commented on NUTCH-158:
------------------------------------------

This is an important thing 

We should automaticall be able to insert the links parsed out of site map into 
webdb

But currently if we enable parse-rss and crawl these links ,dont they get added

> Process Sitemap data in text, rss or xml format as well as OAI-PMH
> ------------------------------------------------------------------
>
>          Key: NUTCH-158
>          URL: http://issues.apache.org/jira/browse/NUTCH-158
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: byron miller
>     Priority: Minor

>
> Add support to the fetcher to look for sitemap files, download them and 
> process them into webdb.
> Perhaps create a robots.txt directive that can be used to create a standard 
> format for sitemaps in RSS, XML or text format (one line per url) and process 
> that.
> I would love to see someone stomp on proprietary sitemap features or making 
> things so google specific as they are today :)
> * RSS format/Atom Format (standard)
> * XML meta descroption
> * OAI-PMH meta description 
> (http://www.openarchives.org/OAI/openarchivesprotocol.html)
> Perhaps even a "pre crawler" that will scour for these to inject into the web 
> db to help build your link map so you could even just index topN.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-158) Process Sitemap data in text, rss or xml format as well as OAI-PMH

Reply via email to