Hi, I have find a patch for my metadata problem [1]. But , the problem isn't solved for 2.x [2]. I guess, I need to solve it.
[1] https://issues.apache.org/jira/browse/NUTCH-1622 [2] https://issues.apache.org/jira/browse/NUTCH-1816 2015-07-04 15:56 GMT+03:00 Cihad Guzel <[email protected]>: > Hi Lewis, > > I and Talat talk about architecture for sitemap supporting . We thought > the problem could be solved in nutch life cycle . We don't want to build a > different life cycle for sitemap crawling. > > So, I have some problems as following: > > If the sitemap file is too large size, it can not be fetched and parsed. > It gets timeout. I solved timeout problem temporarily to parse by raising > the value of timeout in nutch-site.xml and to fetch by working small size > file. It is not good. > > Moreover, you know sitemap files have some special tags as "loc", > "lastmod", "changefreq" or "priority". It has been parsed using my parse > plugin. I want to record to crawldb, but the Parse object doesn't > support metadata or same fields. It has only outlink array. It isn't enough > for recording metadata. > > I want to record each url in sitemap file with the metadata seperately. > > I viewed all patchs and comments from NUTCH-1465 and there are some > solution for same problems in it. But, new job for sitemap crawling have > been created. > > Could you show me a way out? > > Thanks. >

