[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564836#comment-13564836
]
Markus Jelsma commented on NUTCH-1465:
--------------------------------------
Thanks all for your interesting comments.
It's a complicated issue. One one hand host data should be stored in NUTCH-1325
but that would require additional logic and sending each segment output to the
hostdb, in case there's a sitemap crawled. On the other hand it's ideal to
store host data. It's also easy to use in jobs such as the indexer and
generator.
I don't yet favour a specific approach but storing sitemap data in a hostdb may
be something to think about.
Cheers
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Tejas Patil
> Fix For: 1.7
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira