[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677
 ] 

Tejas Patil commented on NUTCH-1465:
------------------------------------

Interesting comments [~wastl-nagel].

Re "filters and normalizers" : By default I have kept those ON but can be 
disabled by using "-noFilter" and "-noNormalize".
Re "default content limits" and "fetch timeout": +1. Agree with you.
Re "Processing sitemap indexes fails" : +1. Nice catch.
Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, 
Injector allows users to provide a custom fetch interval with any value eg. 1 
sec. It makes sense not the correct it as user wants Nutch use that custom 
fetch interval. If we view sitemaps as custom seed list given by a content 
owner, then it would make sense to follow the intervals. But as you said that 
sitemaps can be wrongly set or outdated, the intervals might be incorrect. The 
question bolis down to: We are blindly accepting user's custom information in 
inject. Should we blindly assume that sitemaps are correct or not ? I have no 
strong opinion about either side of the argument. 

(PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 
1 hr as per db.fetch.schedule.adaptive.min_interval <= interval)

Re "SitemapReducer overwriting" : 
>> _"If a sitemap does not specify one of score, modified time, or fetch 
>> interval this values is set to zero. "_
Nope. See 
[SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java]

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
provided in sitemap. 
We can do this: If an old entry has score other than 0.5, it can be preserved 
else update. For new entry, use scoring plugins for score equal to 0.5, else 
preserve the same. 
Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap 
or the default one if <changefreq> was absent.
 (b) fetch interval : Crawler commons does NOT set fetch interval if there was 
none provided in sitemap. So we are sure that whatever value is used is coming 
from <changefreq>. Validation might be needed as per comments above.
 (c) modified time : Same as fetch interval, unless parsed from sitemap file, 
modified time is set to NULL. Only possible validation is to drop values 
greater than current time.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to