[
https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309965#comment-16309965
]
ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------
lewismc closed pull request #269: fix for NUTCH-2490 Fix sitemap index file
processing
URL: https://github.com/apache/nutch/pull/269
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/java/org/apache/nutch/util/SitemapProcessor.java
b/src/java/org/apache/nutch/util/SitemapProcessor.java
index 5150d61c3..c1c0c9a81 100644
--- a/src/java/org/apache/nutch/util/SitemapProcessor.java
+++ b/src/java/org/apache/nutch/util/SitemapProcessor.java
@@ -213,6 +213,7 @@ private void generateSitemapUrlDatum(Protocol protocol,
String url, Context cont
AbstractSiteMap asm = parser.parseSiteMap(content.getContentType(),
content.getContent(), new URL(url));
if(asm instanceof SiteMap) {
+ LOG.info("Parsing sitemap file: {}", asm.getUrl().toString());
SiteMap sm = (SiteMap) asm;
Collection<SiteMapURL> sitemapUrls = sm.getSiteMapUrls();
for(SiteMapURL sitemapUrl: sitemapUrls) {
@@ -252,10 +253,13 @@ else if (asm instanceof SiteMapIndex) {
SiteMapIndex index = (SiteMapIndex) asm;
Collection<AbstractSiteMap> sitemapUrls = index.getSitemaps();
+ if (sitemapUrls.isEmpty()) {
+ return;
+ }
+
+ LOG.info("Parsing sitemap index file: {}", index.getUrl().toString());
for(AbstractSiteMap sitemap: sitemapUrls) {
- if(sitemap.isIndex()) {
- generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(),
context);
- }
+ generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(),
context);
}
}
}
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
> Key: NUTCH-2490
> URL: https://issues.apache.org/jira/browse/NUTCH-2490
> Project: Nutch
> Issue Type: Bug
> Reporter: Moreno Feltscher
> Assignee: Moreno Feltscher
> Fix For: 1.15
>
>
> The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature]
> does not properly handle sitemap index files due to a unnecessary conditional.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)