GSOC2015 - sitemap parser

Cihad Guzel Sat, 20 Jun 2015 16:43:15 -0700

Hi all

I have been working for 3 weeks. I prepare a simple weekly report on the
nutch wiki[1] in the meantime. You can follow my working progress from the
wiki page[2]. You can also follow my code devolopment from my github
repo[3].


I want to mention about my weekly works. Sitemaps urls are crawled in two
ways. The first one: You can inject the sitmap urls using InjectorJob from
seed file. Then the sitemaps fetched are parsed by the sitemap parser
plugin. Urls obtained by parsing are written to outlink column in db. These
are signed as sitemap.

The second one: The sitemap urls are detected from the robot.txt file. The
urls are written to sitemap(stm) column in db and these are signed as
sitemap.

 I have attached a graph on the jira issue[4]. You can see the life cycle
of sitemap parser from there[5].

 I am waiting for your opinion. I shape my works according to your idea.

 Thanks...

[1]
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport
[2] https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler
[3] https://github.com/cguzel/nutch-sitemapCrawler
[4] https://issues.apache.org/jira/browse/NUTCH-1741
[5]
https://issues.apache.org/jira/secure/attachment/12707721/SitemapCrawlerLifeCycle.pdf

--
Kind Regards
Cihad

GSOC2015 - sitemap parser

Reply via email to