[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Tejas Patil (JIRA) Mon, 28 Jan 2013 16:19:15 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564883#comment-13564883
 ]


Tejas Patil commented on NUTCH-1465:
------------------------------------

Hi Sebastian,

So we are looking at 2 things here:
- a standalone utility for injecting sitemaps to crawldb: 
-# User starts off with urls to sitemap pages
-# SitemapInjector fetches these seeds, parses it (with a parse plugin based on 
CC)
-# SitemapInjector updates the crawldb with the sitemap entries.

- handling of sitemap within the nutch cycle: fetch, parse and update phases
-# Robots parsing will populate a table of "host": <_list of links to sitemap 
pages_>
-# These will be added to the fetcher queue and will be fetched
-# A parser plugin based on CC will parse the sitemap page
-# Outlink class needs to be extended to store the meta obtained from sitemap
-# Write this into the segment
-# Update phase needs to update the crawl frequency of already existing urls in 
crawldb based on what we got from the sitemap. Else just add new entires to the 
crawldb.

I am not clear about the extending outlink thing. The normal outlink extraction 
need not be done as CC will already do that for us. Sitemap parser plugin must 
do this and create objects of our specialized sitemap link. While writing, 
where is CrawlDatum generated from the outlink ?

The mime type that we get is "text/xml" which can also mean a normal xml file. 
How will nutch identify if its a sitemap page and invoke the correct parser 
plugin ? (I know that this magic is done by feed parser but not sure which part 
of code is doing that. Just point me to that code).

                
> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.7
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to