GSOC2015- Sitemap crawler roudmap problems

Cihad Guzel Sat, 04 Jul 2015 05:57:34 -0700

Hi Lewis,

I and Talat talk about architecture for sitemap supporting . We thought the
problem could be solved in nutch life cycle . We don't want to build a
different life cycle for sitemap crawling.


So, I have some problems as following:

If the sitemap file is too large size, it can not be fetched and parsed. It
gets timeout. I solved timeout problem temporarily to parse by raising the
value of timeout in nutch-site.xml and to fetch by working small size file.
It is not good.

Moreover, you know sitemap files have some special tags as "loc",
"lastmod", "changefreq" or "priority". It has been parsed using my parse
plugin. I want to  record to crawldb, but the Parse  object doesn't support
metadata or same fields. It has only outlink array. It isn't enough for
recording metadata.

I want to record each url in sitemap file with the metadata seperately.

I viewed all patchs and comments from NUTCH-1465 and there are some
solution for same problems in it. But, new job for sitemap crawling have
been created.

Could you show me a way out?

Thanks.

GSOC2015- Sitemap crawler roudmap problems

Reply via email to