[
https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308851#comment-16308851
]
ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------
mfeltscher opened a new pull request #269: fix for NUTCH-2490 Fix sitemap index
file processing
URL: https://github.com/apache/nutch/pull/269
This fixes processing of sitemap index files by removing a unnecessary
conditional.
Before:
```bash
$ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch
sitemap crawldata -sitemapUrls sitemaps.txt
SitemapProcessor: sitemap urls dir: sitemaps.txt
SitemapProcessor: Starting at 2018-01-02 22:44:58
robots.txt whitelist not configured.
SitemapProcessor: Total records rejected by filters: 0
SitemapProcessor: Total sitemaps from HostDb: 0
SitemapProcessor: Total sitemaps from seed urls: 1
SitemapProcessor: Total failed sitemap fetches: 0
SitemapProcessor: Total new sitemap entries added: 0
SitemapProcessor: Finished at 2018-01-02 22:45:02, elapsed: 00:00:03
````
After:
```bash
$ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch
sitemap crawldata -sitemapUrls sitemaps.txt
SitemapProcessor: sitemap urls dir: sitemaps.txt
SitemapProcessor: Starting at 2018-01-02 22:47:44
robots.txt whitelist not configured.
Parsing sitemap index file: https://filialen.migros.ch/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/de/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/fr/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/it/sitemap.xml
SitemapProcessor: Total records rejected by filters: 0
SitemapProcessor: Total sitemaps from HostDb: 0
SitemapProcessor: Total sitemaps from seed urls: 1
SitemapProcessor: Total failed sitemap fetches: 0
SitemapProcessor: Total new sitemap entries added: 5754
SitemapProcessor: Finished at 2018-01-02 22:47:58, elapsed: 00:00:13
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
> Key: NUTCH-2490
> URL: https://issues.apache.org/jira/browse/NUTCH-2490
> Project: Nutch
> Issue Type: Bug
> Reporter: Moreno Feltscher
> Assignee: Moreno Feltscher
>
> The [sitemap processing
> feature](https://wiki.apache.org/nutch/SitemapFeature) does not properly
> handle sitemap index files due to a unnecessary conditional.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)