[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

Apache Wiki Mon, 01 Jun 2015 13:11:14 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=6&rev2=7

  ||'''Student :'''||||Cihad Güzel - [email protected]||
  ||'''Mentors :'''||||[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis 
John McGibbney]], [[https://wiki.apache.org/nutch/talat|Talat Uyarer]]||
  
- == Abstract ==
+ === Abstract ===
  
  The url’s can be got from only pages that were scanned before in nutch 
crawler system. This method is expensive. Also, the degrees of importance and 
“change frequance” of these urls are not known only guessed. But, it is 
possible to find the whole of urls in a up-to-date sitemap file. For this 
reason, sitemap files in website should be crawled. Nutch project will have 
that support of sitemap crawler thanks to this development.
  
- == Introduction ==
+ === Introduction ===
  
  Sitemap is a file guiding to crawl website in a better way and it has 
different file formats (such as simple text format, xml format, rss 2.0, atom 
0.3 & 1.0). 
  
@@ -23, +23 @@

   * Sitemap crawler can be followed by reporting the errors occuring during 
crawling. 
   * The management and configuration of sitemap crawler are under the control 
of user.
  
- == Project Details: ==
+ === Project Details: ===
  
  It is aimed to power nutch project by sitemap crawler support. The main 
target is to detect the sitemap having correct urls and to be crawled. It is 
easy and fast to find correct ursl by sitemap crawler. The software will make 
following features possible.
  
@@ -66, +66 @@

   * The current nutch plugins can be used.
   * There are some studies about sitemap crawler in Nutch project  (NUTCH-1741 
[1], NUTCH-1465 [2]). The process improves by taking hand  the weak and strong 
sides of the project 
  
- == Timeline: ==
+ === Timeline: ===
  
  Project development process can be divided into two steps. Firstly, nutch 
crawler life cycle will be updated for sitemap crawler. Sitemap will be crawled 
in a simple way before midterm.
  In the next stage, Other issues will be completed such as sitemap detection, 
filter & ranking mechanizm, documentation and tests.
  
- ===== Pre-GSoC =====
- The studies and the comments on NUTCH-1741 [1] and NUTCH-1465 [2] will be 
followed. 
+   '''Pre-GSoC : ''' The studies and the comments on NUTCH-1741 [1] and 
NUTCH-1465 [2] will be followed. 
  
    * Week1 (25May-31May): sitemap url injection will be done. 
    * Week2 (1June-7June): Sitemap detection will be done. FetcherJob will be 
updated for   sitemap.
@@ -87, +86 @@

    * Week12-13 (10Agust-23Agust): Further refine tests and documentation for 
the whole project.
  
  
- ==== Features that will be developed after GSOC: ====
+   '''Features that will be developed after GSOC:''' Sitemap crawler report 
page, Sitemap monitoring page, Video Sitemaps crawler.
  
- Sitemap crawler report page,
- Sitemap monitoring page.
- Video Sitemaps crawler.
- 
- ==== Reference: ====
+ === Reference: ===
  
   *[1] https://issues.apache.org/jira/browse/NUTCH-1741
   *[2] https://issues.apache.org/jira/browse/NUTCH-1465
@@ -101, +96 @@

  
  
  
- ==== Reports ====
+ === Reports ===
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week1|Week1 
(25May-31May)]]
   *  
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week2|Week2 
(1June-7June)]]
  
- ==== Documentation ====
+ === Documentation ===
  Documents will be added here.
  
- ==== Jira Issues ====
+ === Jira Issues ===
  
   * https://issues.apache.org/jira/browse/NUTCH-1741
   * https://issues.apache.org/jira/browse/NUTCH-1465

[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel

Reply via email to