Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel: https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=6&rev2=7 ||'''Student :'''||||Cihad Güzel - [email protected]|| ||'''Mentors :'''||||[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis John McGibbney]], [[https://wiki.apache.org/nutch/talat|Talat Uyarer]]|| - == Abstract == + === Abstract === The url’s can be got from only pages that were scanned before in nutch crawler system. This method is expensive. Also, the degrees of importance and “change frequance” of these urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will have that support of sitemap crawler thanks to this development. - == Introduction == + === Introduction === Sitemap is a file guiding to crawl website in a better way and it has different file formats (such as simple text format, xml format, rss 2.0, atom 0.3 & 1.0). @@ -23, +23 @@ * Sitemap crawler can be followed by reporting the errors occuring during crawling. * The management and configuration of sitemap crawler are under the control of user. - == Project Details: == + === Project Details: === It is aimed to power nutch project by sitemap crawler support. The main target is to detect the sitemap having correct urls and to be crawled. It is easy and fast to find correct ursl by sitemap crawler. The software will make following features possible. @@ -66, +66 @@ * The current nutch plugins can be used. * There are some studies about sitemap crawler in Nutch project (NUTCH-1741 [1], NUTCH-1465 [2]). The process improves by taking hand the weak and strong sides of the project - == Timeline: == + === Timeline: === Project development process can be divided into two steps. Firstly, nutch crawler life cycle will be updated for sitemap crawler. Sitemap will be crawled in a simple way before midterm. In the next stage, Other issues will be completed such as sitemap detection, filter & ranking mechanizm, documentation and tests. - ===== Pre-GSoC ===== - The studies and the comments on NUTCH-1741 [1] and NUTCH-1465 [2] will be followed. + '''Pre-GSoC : ''' The studies and the comments on NUTCH-1741 [1] and NUTCH-1465 [2] will be followed. * Week1 (25May-31May): sitemap url injection will be done. * Week2 (1June-7June): Sitemap detection will be done. FetcherJob will be updated for sitemap. @@ -87, +86 @@ * Week12-13 (10Agust-23Agust): Further refine tests and documentation for the whole project. - ==== Features that will be developed after GSOC: ==== + '''Features that will be developed after GSOC:''' Sitemap crawler report page, Sitemap monitoring page, Video Sitemaps crawler. - Sitemap crawler report page, - Sitemap monitoring page. - Video Sitemaps crawler. - - ==== Reference: ==== + === Reference: === *[1] https://issues.apache.org/jira/browse/NUTCH-1741 *[2] https://issues.apache.org/jira/browse/NUTCH-1465 @@ -101, +96 @@ - ==== Reports ==== + === Reports === * [[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week1|Week1 (25May-31May)]] * [[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week2|Week2 (1June-7June)]] - ==== Documentation ==== + === Documentation === Documents will be added here. - ==== Jira Issues ==== + === Jira Issues === * https://issues.apache.org/jira/browse/NUTCH-1741 * https://issues.apache.org/jira/browse/NUTCH-1465

