Hi Lewis. Thanks for your suggestions. I will be thinking about this.
2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney <[email protected]>: > Hi Cihad, > I'll take a look tonight. > My understanding is that this would be implemented as part of core and not > as a plugin. Within the plugin we can, at time, have acesss to less verbose > data structures. This is of course not always the case, but generally > speaking we see more issues, depending on which interfaces we extend, with > appropriate access to the correct data structures. We then have the issue > of dependency management. > I'll have a look through the various links you have sent and then write > back here in due course. > Apologies about the delay. > Thanks > > On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <[email protected]> wrote: > >> Hi, >> >> I have find a patch for my metadata problem [1]. But , the problem isn't >> solved for 2.x [2]. I guess, I need to solve it. >> >> [1] https://issues.apache.org/jira/browse/NUTCH-1622 >> [2] https://issues.apache.org/jira/browse/NUTCH-1816 >> >> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <[email protected]>: >> >>> Hi Lewis, >>> >>> I and Talat talk about architecture for sitemap supporting . We thought >>> the problem could be solved in nutch life cycle . We don't want to build a >>> different life cycle for sitemap crawling. >>> >>> So, I have some problems as following: >>> >>> If the sitemap file is too large size, it can not be fetched and parsed. >>> It gets timeout. I solved timeout problem temporarily to parse by raising >>> the value of timeout in nutch-site.xml and to fetch by working small size >>> file. It is not good. >>> >>> Moreover, you know sitemap files have some special tags as "loc", >>> "lastmod", "changefreq" or "priority". It has been parsed using my parse >>> plugin. I want to record to crawldb, but the Parse object doesn't >>> support metadata or same fields. It has only outlink array. It isn't enough >>> for recording metadata. >>> >>> I want to record each url in sitemap file with the metadata seperately. >>> >>> I viewed all patchs and comments from NUTCH-1465 and there are some >>> solution for same problems in it. But, new job for sitemap crawling have >>> been created. >>> >>> Could you show me a way out? >>> >>> Thanks. >>> >> >> > > > -- > *Lewis* >

