[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979161#comment-15979161 ] Sebastian Nagel edited comment on NUTCH-1465 at 4/21/17 6:14 PM: - Hi Lewis, a couple of month ago I've applied the latest patch here (NUTCH-1465-trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), so I've chosen the [SitemapInjector|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java] from an older patch which was still based on the old mapred API. was (Author: wastl-nagel): Hi Lewis, a couple of month ago I've applied the latest patch here (NUTCH-1465-trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), so I've chosen the [SitemapInjector|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java] from an older patch which was still based on the old maped API. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979161#comment-15979161 ] Sebastian Nagel commented on NUTCH-1465: Hi Lewis, a couple of month ago I've applied the latest patch here (NUTCH-1465-trunk.v5.patch) to master, see https://github.com/sebastian-nagel/nutch/tree/NUTCH-1465. But I had to port this to the Common Crawl fork of Nutch (https://github.com/commoncrawl/nutch), so I've chosen the [SitemapInjector|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java] from an older patch which was still based on the old maped API. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979078#comment-15979078 ] Lewis John McGibbney commented on NUTCH-1465: - I'm going to take this on. We want full sitemap support in our current crawlers so I am making this my priority. I'll submit a pull request for current patches then we can take it from there. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1465: --- Assignee: Lewis John McGibbney (was: Tejas Patil) > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)