Re: GSOC2015- Sitemap crawler roudmap problems

Cihad Guzel Sat, 11 Jul 2015 02:11:30 -0700

Hi Lewis.

Thanks for your suggestions. I will be thinking about this.


2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney <[email protected]>:

> Hi Cihad,
> I'll take a look tonight.
> My understanding is that this would be implemented as part of core and not
> as a plugin. Within the plugin we can, at time, have acesss to less verbose
> data structures. This is of course not always the case, but generally
> speaking we see more issues, depending on which interfaces we extend, with
> appropriate access to the correct data structures. We then have the issue
> of dependency management.
> I'll have a look through the various links you have sent and then write
> back here in due course.
> Apologies about the delay.
> Thanks
>
> On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <[email protected]> wrote:
>
>> Hi,
>>
>> I have find a patch for my metadata problem [1]. But , the problem isn't
>> solved for 2.x [2]. I guess, I need to solve it.
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-1622
>> [2] https://issues.apache.org/jira/browse/NUTCH-1816
>>
>> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <[email protected]>:
>>
>>> Hi Lewis,
>>>
>>> I and Talat talk about architecture for sitemap supporting . We thought
>>> the problem could be solved in nutch life cycle . We don't want to build a
>>> different life cycle for sitemap crawling.
>>>
>>> So, I have some problems as following:
>>>
>>> If the sitemap file is too large size, it can not be fetched and parsed.
>>> It gets timeout. I solved timeout problem temporarily to parse by raising
>>> the value of timeout in nutch-site.xml and to fetch by working small size
>>> file. It is not good.
>>>
>>> Moreover, you know sitemap files have some special tags as "loc",
>>> "lastmod", "changefreq" or "priority". It has been parsed using my parse
>>> plugin. I want to  record to crawldb, but the Parse  object doesn't
>>> support metadata or same fields. It has only outlink array. It isn't enough
>>> for recording metadata.
>>>
>>> I want to record each url in sitemap file with the metadata seperately.
>>>
>>> I viewed all patchs and comments from NUTCH-1465 and there are some
>>> solution for same problems in it. But, new job for sitemap crawling have
>>> been created.
>>>
>>> Could you show me a way out?
>>>
>>> Thanks.
>>>
>>
>>
>
>
> --
> *Lewis*
>

Re: GSOC2015- Sitemap crawler roudmap problems

Reply via email to