[
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016943#comment-16016943
]
ASF GitHub Bot commented on NUTCH-1465:
---------------------------------------
lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703
@sebastian-nagel I've addressed all but two of your comments and responded.
I've also implemented parameterized logging. In addition, I've dropped the
STATUS_SITEMAP replacing instances with STATUS_INJECTED.
N.B. when I run this as follows i am not currently able to inject any URLs
into the CrawlDB
```
//First I inject a random URL to create a CrawlDB
lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch
inject crawl urls/
Injector: starting at 2017-05-18 23:01:14
Injector: crawlDb: crawl
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 1
Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01
// I then, attempt to process a sitemap at
http://www.autotrader.com/sitemap.xml which I've added to a file located in a
'sitemaps' directory
lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch
sitemap crawl -sitemapUrls sitemaps
SitemapProcessor: sitemap urls dir: sitemaps
SitemapProcessor: Starting at 2017-05-18 23:06:38
robots.txt whitelist not configured.
SitemapProcessor: Total records rejected by filters: 0
SitemapProcessor: Total sitemaps from HostDb: 0
SitemapProcessor: Total sitemaps from seed urls: 1
SitemapProcessor: Total failed sitemap fetches: 0
SitemapProcessor: Total new sitemap entries added: 0
SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10
// Lets read the DB
lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch
readdb crawl -stats
CrawlDb statistics start: crawl
Statistics for CrawlDb: crawl
TOTAL urls: 1
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:00:00
longest fetch interval: 30 days, 00:00:00
earliest fetch time: Thu May 18 23:01:00 PDT 2017
avg of fetch times: Thu May 18 23:01:00 PDT 2017
latest fetch time: Thu May 18 23:01:00 PDT 2017
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done
```
As you can see no URLs seem to be processed as the new sitemap entries added
is zero, this is confirmed by the readdb output.
I need to do some more debugging and see where the bug(s) are. If anyone is
able to try this patch out and has an interest in Sitemap support in Nutch
master it would be highly appreciated.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Support sitemaps in Nutch
> -------------------------
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch,
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch,
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch,
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0
> licensed and appears to have been used successfully to parse sitemaps as per
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1]
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)