[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092938#comment-16092938
 ] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

I've modified the description of the properties 
[sitemap.strict.parsing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2555]
 and 
[sitemap.url.overwrite.existing|https://github.com/sebastian-nagel/nutch/commit/de92a387ba5314b202da9fc006979927fe697be0#diff-d45b2920590dbd66188eb546753d1834R2589].
 But feel free add your modifications/additions. I just tried to make it 
understandable by anyone who does not know the gory details.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to