midtermreport" by CihadGuzel

Apache Wiki Fri, 26 Jun 2015 02:08:30 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "GoogleSummerOfCode/SitemapCrawler/midtermreport" page has been changed by 
CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/midtermreport

New page:
= Support Sitemap Crawler in Nutch 2.x Midterm Report =

||'''Title :'''||||GSOC 2015 Midterm Report||
||'''Reporting Date :'''||||25th June 2015||
||'''Issue :'''|||| 
[[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler|NUTCH-1741 - 
Support Sitemap Crawler in Nutch 2.x]]||
||'''Student :'''||||Cihad Güzel - [email protected]||
||'''Mentors :'''||||[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis 
John McGibbney]], [[https://wiki.apache.org/nutch/talat|Talat Uyarer]]||
||'''Development Codebase: 
:'''||||[[https://github.com/cguzel/nutch-sitemapCrawler|Github Repo Url]]||

<<TableOfContents(4)>>

== Introduction ==
It is tried to develop a crawler for sitemap files. This report includes the 
development of the works and following plans.

Research and development were done in last 4 weeks for sitemap crawler. The 
timeline said in proposal was followed successfully. It is aimed to complete 
the project in target time for rest weeks.

== Previous Actions ==

Some studies were done to make sitemap files a part of nutch life cycle  that 
are Injector , Fetcher , Parser and DbUpdater.

It is possible to crawl sitemap files in two way as followings:

 1. The url list  wanted to be crawled is listed in seed file. The urls in this 
list is crawled by passing nutch life cycle. If sitemaps files are also wanted 
to crawl, this sitemap file paths are defined in seed file.Normally, a seed 
file must be as following:

  *http://www.example.com/
  *http://www.example2.com/
  *http://www.example3.com/

 If you have two sitemap files for "http://www.example.com/"; , you can define 
them in the seed file as following:

  *http://www.example.com/ sitemaps: sitemap1.xml sitemap2.xml
  *http://www.example2.com/
  *http://www.example3.com/

 Thus the sitemap file paths that are wanted to be crawled are defined manuelly 
in seed file. When the InjectorJob works; the sitemap near the “sitemap” label 
is written in database as if it is a new url. These urls are crawled by passing 
form normal nutch life cycle.

 The urls are signed as sitemap in database. Thus during the part of parse, 
only urls singed are parsed by “sitemap-parser”.

 
 2. Besides the first way, sitemap files can be detected automatically. During 
the fetch of urls, robot.txt file is checked. The aim of this is to organise 
crawling according to rules defined in robot.txt file. This robot file can 
include not only these rules but also sitemap files list. At the time of fetch, 
the sitemap files that are listed in robot.txt are checked. If there is a list, 
urls are added in db.

A column named “stm” is added in database to add sitemap files detected during 
the fetch. “stm” is the abbreviation of “sitemap”.

The url list in “stm” column is added as new line in database during Dbupdater 
and thus it is added in nutch life cycle.

To finish the crawling sitemap, while sitemap urls added during the inject 
passes once, the urls detected from robot.txt passes twice. Because the urls 
are recorded as new line directly at the time of inject and parsed later. 
However at the time of detection, first it is added in stm column, then it is 
added as new line at DbUpdater step and finally it is parsed [3].

A parser plugin was written to complete parsing. This plugin sends urls in 
sitemap files to db, after parsing sitemap urls. Thus the crawling of a sitemap 
url is completed. To activate sitemap plugin, it must be added to 
“nutch-site.xml” like the other plugins.
In the result of this work, multi-sitemap parsing can be done. And also, only 
inlinks are got after sitemap parsing. If there is any outlink defined in 
sitemap file, they are ignored.

The works mentioned till this moment are done according to following steps [4].

 * Week1(25May-31May): sitemap list injection: 
 * Week2(1June-7June): sitemap detection: 
 * Week3&4(8June-21June): sitemap parser plugin
 * Week5(22June-28June): Dbupdater is improved for sitemap crawler

== Future Plans ==

Sitemap crawling was done basicly until now. However improvement must be done 
for every step. In following weeks, some improvement will be done by taking 
into account the features of sitemap file. As determined in proposal, a plan is 
programmed for other weeks [5].

 * Week6&7(29June-12July): Sitemap ranking mechanism will be developed.
 * Week8(13July-19July): Sitemap black list, sitemap error detection 
 * Week9(20July-26July): Frequent mechanism will be developed
 * Week10(27July-2Agust): A filter plugins will be updated 
 * Week11(3Agust-9Agust): Code review and code cleaning.
 * Week12&13(10Agust-23Agust): Further refine tests and documentation for the 
whole project.

== References ==

 *[1] https://issues.apache.org/jira/browse/NUTCH-1741
 *[2] https://issues.apache.org/jira/browse/NUTCH-1465
 *[3] 
https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf
 *[4] 
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport 
 *[5] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler

[Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/midtermreport" by CihadGuzel

Reply via email to