Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by CihadGuzel: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=8&rev2=9 '''Title :''' Sitemap detection is done. - Robot.txt file is checked while fetcher job is run. If robot.txt file have any sitemap urls, these are written to database. A column called sitemap(stm) for sitemap is added to db schema. The urls in stm column from db will be parsed at the next time. + Robot.txt is a file on the website. The file has sitemap url list. So, sitemap url list of a website can be accessed from this file. + + Nutch Project reads robot.txt file while fetcher job is running. The file is checked from new code block of sitemap crawler. If it has any sitemap urls, these are written to stm(sitemap) column in the webpage table on the database. + + The stm(sitemap)column is added to webpage schema for sitemap crawler. The urls in stm column from db will be parsed at the next time. || '''Week :''' 3 & 4 (8 June 2015 - 21 June 2015) ||

