[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

ASF GitHub Bot (JIRA) Thu, 18 May 2017 23:11:57 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016943#comment-16016943
 ]


ASF GitHub Bot commented on NUTCH-1465:
---------------------------------------

lewismc commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-302617703
 
 
   @sebastian-nagel I've addressed all but two of your comments and responded. 
I've also implemented parameterized logging. In addition, I've dropped the 
STATUS_SITEMAP replacing instances with STATUS_INJECTED.
   N.B. when I run this as follows i am not currently able to inject any URLs 
into the CrawlDB
   ```
   //First I inject a random URL to create a CrawlDB
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
inject crawl urls/
   Injector: starting at 2017-05-18 23:01:14
   Injector: crawlDb: crawl
   Injector: urlDir: urls
   Injector: Converting injected urls to crawl db entries.
   Injector: overwrite: false
   Injector: update: false
   Injector: Total urls rejected by filters: 0
   Injector: Total urls injected after normalization and filtering: 1
   Injector: Total urls injected but already in CrawlDb: 0
   Injector: Total new urls injected: 1
   Injector: finished at 2017-05-18 23:01:15, elapsed: 00:00:01
   
   // I then, attempt to process a sitemap at 
http://www.autotrader.com/sitemap.xml which I've added to a file located in a 
'sitemaps' directory
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
sitemap crawl -sitemapUrls sitemaps
   SitemapProcessor: sitemap urls dir: sitemaps
   SitemapProcessor: Starting at 2017-05-18 23:06:38
   robots.txt whitelist not configured.
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 0
   SitemapProcessor: Finished at 2017-05-18 23:06:48, elapsed: 00:00:10
   
   // Lets read the DB
   
   lmcgibbn@LMC-056430 /usr/local/nutch(NUTCH-1465) $ ./runtime/local/bin/nutch 
readdb crawl -stats
   CrawlDb statistics start: crawl
   Statistics for CrawlDb: crawl
   TOTAL urls:  1
   shortest fetch interval:     30 days, 00:00:00
   avg fetch interval:  30 days, 00:00:00
   longest fetch interval:      30 days, 00:00:00
   earliest fetch time: Thu May 18 23:01:00 PDT 2017
   avg of fetch times:  Thu May 18 23:01:00 PDT 2017
   latest fetch time:   Thu May 18 23:01:00 PDT 2017
   retry 0:     1
   min score:   1.0
   avg score:   1.0
   max score:   1.0
   status 1 (db_unfetched):     1
   CrawlDb statistics: done
   ```
   As you can see no URLs seem to be processed as the new sitemap entries added 
is zero, this is confirmed by the readdb output.
   I need to do some more debugging and see where the bug(s) are. If anyone is 
able to try this patch out and has an interest in Sitemap support in Nutch 
master it would be highly appreciated.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to