Re: GSOC2015- Sitemap crawler roudmap problems

Cihad Guzel Mon, 06 Jul 2015 00:21:01 -0700

Hi,

I have find a patch for my metadata problem [1]. But , the problem isn't
solved for 2.x [2]. I guess, I need to solve it.


[1] https://issues.apache.org/jira/browse/NUTCH-1622
[2] https://issues.apache.org/jira/browse/NUTCH-1816

2015-07-04 15:56 GMT+03:00 Cihad Guzel <[email protected]>:

> Hi Lewis,
>
> I and Talat talk about architecture for sitemap supporting . We thought
> the problem could be solved in nutch life cycle . We don't want to build a
> different life cycle for sitemap crawling.
>
> So, I have some problems as following:
>
> If the sitemap file is too large size, it can not be fetched and parsed.
> It gets timeout. I solved timeout problem temporarily to parse by raising
> the value of timeout in nutch-site.xml and to fetch by working small size
> file. It is not good.
>
> Moreover, you know sitemap files have some special tags as "loc",
> "lastmod", "changefreq" or "priority". It has been parsed using my parse
> plugin. I want to  record to crawldb, but the Parse  object doesn't
> support metadata or same fields. It has only outlink array. It isn't enough
> for recording metadata.
>
> I want to record each url in sitemap file with the metadata seperately.
>
> I viewed all patchs and comments from NUTCH-1465 and there are some
> solution for same problems in it. But, new job for sitemap crawling have
> been created.
>
> Could you show me a way out?
>
> Thanks.
>

Re: GSOC2015- Sitemap crawler roudmap problems

Reply via email to