[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880305#comment-13880305
]
Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------
hey [~tejasp] no probs. RE: #3, I was just curious to see if we could reuse
some of the method we had in URLUtil. Now that I've looked I feel you're right.
This patch reminds me of pushing out to filtering and normalization to crawler
commons anyway but that is another can of worms :)
I'll let others comments here. Right now I am +1 on this patch.
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch,
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch,
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)