[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848561#comment-13848561
 ] 

Tejas Patil commented on NUTCH-1465:


Revisited this Jira after a long time and gave a thought how this can be done 
cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This 
would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 
'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time 
limits. Fetch code must have a special case to take into account that the url 
is a sitemap url and use custom limits = leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like 
lastmod, changefreq). Modify it to hold this information too. This would be 
specific for sitemaps only yet we end up making all outlinks to hold this info. 
We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the 
crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got 
all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
  - get the robots page, extract sitemap urls, 
  - get xml content of these sitemap pages
  - create crawl datums with the requried info and write this to a 
sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than 
(A) wherein sitemap urls are sitting in the crawldb and get fetched along with 
normal pages (thus, eating up fetch time of every fetch phase). We can have a 
sitemap_fequency used insdie the crawl script so that users say that after 'x' 
nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. 
Running sitemap job 1-5 times in a month on a production level crawl would work 
out well.

I am inclined towards implementing (B)

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1465:
---

Attachment: NUTCH-1465-sitemapinjector-trunk-v1.patch

Hi Tejas,
attached you'll find a patch for a sitemap injector. Originally written by 
Hannes Schwarz, it's used by use for a couple of time. The patch contains a 
revised and improved version which, however, needs some more work (see TODOs in 
code).
The use case is somewhat different from way B: The sitemap injector takes URLs 
of sitemaps (not via robots.txt) and injects them directly to CrawlDb (no extra 
sitemapDB - do we really need an extra DB?). Robots.txt is not used as an 
intermediate step/hop because experience has shown that often customers prepare 
a special sitemap for the site search crawler which differs from the sitemap 
propagated in robots.txt.
Btw., NUTCH-1622 would enable solution A: outlinks now can hold extra info. 

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

 do we really need an extra DB?
I should have been clear with the explanation. sitemapDB is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

 NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil edited comment on NUTCH-1465 at 12/16/13 12:09 AM:
---

Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
i) the number of initial hosts can be large : one time burden for users
ii) crawler discovers new hosts with time : constant pain for users to look out 
for the new hosts discovered and then get sitemaps from robots.txt manually. 
With HostDB from NUTCH-1325 and B, users won't suffer here.

 do we really need an extra DB?
I should have been clear with the explanation. sitemapDB is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

 NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.


was (Author: tejasp):
Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

 do we really need an extra DB?
I should have been clear with the explanation. sitemapDB is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

 NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

 Support sitemaps in Nutch
 -

 Key: NUTCH-1465
 URL: https://issues.apache.org/jira/browse/NUTCH-1465
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
 Fix For: 1.9

 Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
 NUTCH-1465-trunk.v1.patch


 I recently came across this rather stagnant codebase[0] which is ASL v2.0 
 licensed and appears to have been used successfully to parse sitemaps as per 
 the discussion here[1].
 [0] http://sourceforge.net/projects/sitemap-parser/
 [1] 
 http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)