[ 
https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308851#comment-16308851
 ] 

ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------

mfeltscher opened a new pull request #269: fix for NUTCH-2490 Fix sitemap index 
file processing
URL: https://github.com/apache/nutch/pull/269
 
 
   This fixes processing of sitemap index files by removing a unnecessary 
conditional.
   
   Before:
   ```bash
   $ echo "https://filialen.migros.ch/sitemap.xml"; > sitemaps.txt && bin/nutch 
sitemap crawldata -sitemapUrls sitemaps.txt
   SitemapProcessor: sitemap urls dir: sitemaps.txt
   SitemapProcessor: Starting at 2018-01-02 22:44:58
   robots.txt whitelist not configured.
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 0
   SitemapProcessor: Finished at 2018-01-02 22:45:02, elapsed: 00:00:03
   ````
   
   After:
   ```bash
   $ echo "https://filialen.migros.ch/sitemap.xml"; > sitemaps.txt && bin/nutch 
sitemap crawldata -sitemapUrls sitemaps.txt
   SitemapProcessor: sitemap urls dir: sitemaps.txt
   SitemapProcessor: Starting at 2018-01-02 22:47:44
   robots.txt whitelist not configured.
   Parsing sitemap index file: https://filialen.migros.ch/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/de/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/fr/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/it/sitemap.xml
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 5754
   SitemapProcessor: Finished at 2018-01-02 22:47:58, elapsed: 00:00:13
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
>                 Key: NUTCH-2490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2490
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Moreno Feltscher
>            Assignee: Moreno Feltscher
>
> The [sitemap processing 
> feature](https://wiki.apache.org/nutch/SitemapFeature) does not properly 
> handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to