[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
-------------------------------

    Attachment: NUTCH-1465-trunk.v1.patch

This is a work in progress. So far I have done following:
- added new status named STATUS_SITEMAP to CrawlDatum. I plan to use it to 
identify the sitemap urls in update phase using this status.
- modified the robots parsing code to extract the links to sitemap pages.
- Added a new class SitemapProcessor which will cache the links to sitemap 
pages, use the sitemap parser in CC and take care so that for a given host, 
sitemaps are processed just once.

Attached a patch (NUTCH-1465-trunk.v1.patch) for the changes. 
Things pending:
- write the sitemap urls (from Fetcher class) to the segments in form of 
CrawlDatum entries
- modify the update phase to take care of STATUS_SITEMAP and update the crawl 
frequency.

If anyone has any suggestions in terms of design and approach, please let me 
know.
                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to