[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887763#comment-13887763
 ] 

Tejas Patil commented on NUTCH-1465:
------------------------------------

Re "filters and normalizers": +1.

Re "fetch intervals" and "reducer overwriting": I have never encountered bogus 
sitemaps but that was for a intranet crawl and it would be better to take care 
of that in this jira. Here is what I conclude from the discussion till now:
(1)  _fetch interval_: For old entries, don't use the value from sitemap. For 
new ones, use the value from sitemap provided 
(db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max)
(2) _score_: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.
(3) _modified time_: Always use the value from sitemap provided its not a date 
in future.

Did I get it right ?
 
Re "score": I missed that the jar is old. Would file a jira to upgrade CC to 
v0.3 in Nutch.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to