i have now new evidence of occuring multiple contents for one url. i have dumped the most recent fetched segment with nutch readseg and indeed i have found there a record with double content.
the whole segment contains 8 crawl dates with multiple stored content. how can this be? i expect the content to be stored once. it also contains two ParseData objects for this record and one ParseText ParseData:: Version: 5 Status: success(1,0) Title: Stadtgemeinde Melk - www.melk.gv.at - Bürgerservice mit Zukunft! - Tourismus / Kultur & Kirche / Geschichte - Geschichtliches Outlinks: 0 Content Metadata: Content-Length=18549 _fst_=33 nutch.segment.name=20100219012517 Set-Cookie=nostyle=false; path=/ Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0 Cache-Control=private _ftk_=1266541296308 X-AspNet-Version=2.0.50727 nutch.content.digest=bbc5838aa2875cdf908604a28cd00d96 Date=Fri, 19 Feb 2010 00:53:30 GMT nutch.crawl.score=0.06331668 Content-Type=text/html; charset=utf-8 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 ParseData:: Version: 5 Status: success(1,0) Title: Stadtgemeinde Melk - www.melk.gv.at - Bürgerservice mit Zukunft! - Tourismus / Kultur & Kirche / Geschichte - Geschichtliches Outlinks: 0 Content Metadata: Content-Length=18552 _fst_=33 nutch.segment.name=20100219012517 Set-Cookie=nostyle=false; path=/ Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0 Cache-Control=private _ftk_=1266541329740 X-AspNet-Version=2.0.50727 nutch.content.digest=bbc5838aa2875cdf908604a28cd00d96 Date=Fri, 19 Feb 2010 00:54:03 GMT nutch.crawl.score=0.06511798 Content-Type=text/html; charset=utf-8 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 ParseText:: reinhard schwab schrieb: > i implement now this tool by forking SegmentMerger. > i have only added an additional filter in the map method and > keep the segment name. > i have then be surprised, that the reduce method logs 4 times the content > of a crawl datum. > why this? > i have logged then the content objects and they seem to be identical. > i have no explanation or guess how this can happen. > ParseData and ParseText object appear once. > >