Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=8&rev2=9

  
  '''Title :''' Sitemap detection is done. 
  
- Robot.txt file is checked while fetcher job is run. If robot.txt file have 
any sitemap urls, these are written to database. A column called sitemap(stm) 
for sitemap is added to db schema. The urls in stm column from db will be 
parsed at the next time.
+ Robot.txt is a file on the website. The file has sitemap url list. So, 
sitemap url list of a website can be accessed from this file. 
+ 
+ Nutch Project reads robot.txt file while fetcher job is running. The file is 
checked from new code block of sitemap crawler. If it has any sitemap urls, 
these are written to stm(sitemap) column in the webpage table on the database.
+ 
+ The stm(sitemap)column is added to webpage schema for sitemap crawler. The 
urls in stm column from db will be parsed at the next time.
  
  
  || '''Week :''' 3 & 4 (8 June 2015 - 21 June 2015) ||

Reply via email to