i have now new evidence of occuring multiple contents for one url.
i have dumped the most recent fetched segment with
nutch readseg
and indeed i have found there  a record with double content.

the whole segment contains 8 crawl dates with multiple stored content.

how can this be?
i expect the content to be stored once.
it also contains two ParseData objects for this record and one ParseText

ParseData::
Version: 5
Status: success(1,0)
Title: Stadtgemeinde Melk - www.melk.gv.at - Bürgerservice mit Zukunft!
- Tourismus / Kultur & Kirche / Geschichte - Geschichtliches
Outlinks: 0
Content Metadata: Content-Length=18549 _fst_=33
nutch.segment.name=20100219012517 Set-Cookie=nostyle=false; path=/
Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0
Cache-Control=private _ftk_=1266541296308 X-AspNet-Version=2.0.50727
nutch.content.digest=bbc5838aa2875cdf908604a28cd00d96 Date=Fri, 19 Feb
2010 00:53:30 GMT nutch.crawl.score=0.06331668 Content-Type=text/html;
charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

ParseData::
Version: 5
Status: success(1,0)
Title: Stadtgemeinde Melk - www.melk.gv.at - Bürgerservice mit Zukunft!
- Tourismus / Kultur & Kirche / Geschichte - Geschichtliches
Outlinks: 0
Content Metadata: Content-Length=18552 _fst_=33
nutch.segment.name=20100219012517 Set-Cookie=nostyle=false; path=/
Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0
Cache-Control=private _ftk_=1266541329740 X-AspNet-Version=2.0.50727
nutch.content.digest=bbc5838aa2875cdf908604a28cd00d96 Date=Fri, 19 Feb
2010 00:54:03 GMT nutch.crawl.score=0.06511798 Content-Type=text/html;
charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

ParseText::



reinhard schwab schrieb:
> i implement now this tool by forking SegmentMerger.
> i have only added an additional filter in the map method and
> keep the segment name.
> i have then be surprised, that the reduce method logs 4 times the content
> of a crawl datum.
> why this?
> i have logged then the content objects and they seem to be identical.
> i have no explanation or guess how this can happen.
> ParseData and ParseText object appear once.
>
>   

Reply via email to