plz mischa, if your problem is not about delete duplicate just open another 
thread ! thx


Andrzej, thx for all, i will try to run a diff command on the content of the 2 
pages.
i will give you news when done.




> From: mischa.tuffi...@garlik.com
> Subject: Re: dedup dont delete duplicates !
> Date: Wed, 25 Nov 2009 11:45:21 +0000
> To: nutch-user@lucene.apache.org
> 
> Hello All, 
> 
> I am getting the following error in my hadoop.log (see below). It seems to 
> happen everytime I run any of the nutch command line tools :(
> 
> <!--
> 
> 2009-11-25 11:42:49,299 INFO  crawl.Injector - Injector: done
> 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - 
> leasechec...@dfsclient[clientname=dfsclient_-822770266, ugi=nutch,nutch]: 
> java.lang.Throwable: for testing
>       at 
> org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992)
>       at java.lang.String.valueOf(String.java:2827)
>       at java.lang.StringBuilder.append(StringBuilder.java:115)
>       at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981)
>       at java.lang.Thread.run(Thread.java:619)
>  is interrupted.
> java.lang.InterruptedException: sleep interrupted
>       at java.lang.Thread.sleep(Native Method)
>       at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978)
>       at java.lang.Thread.run(Thread.java:619)
> 
> --> 
> 
> Does anyone know what problem I am having ?
> 
> Cheers, 
> 
> Mischa 
> 
> On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
> 
> > BELLINI ADAM wrote:
> >> hi,
> >> my two urls points to the same page !
> > 
> > Please, no need to shout ...
> > 
> > If the MD5 signatures are different, then the binary content of these pages 
> > is different, period.
> > 
> > Use readseg -dump utility to retrieve the page content from the segment, 
> > extract just the two pages from the dump, and run a unix diff utility.
> > 
> >> can you tell m eplz more about TextProfileSignature ? how should i
> >> use it
> > 
> > Configure this type of signature in your nutch-site.xml - please see the 
> > nutch-default.xml for instructions. Please note that you will have to 
> > re-parse segments and update the db in order to update the signatures.
> > 
> > 
> > 
> > -- 
> > Best regards,
> > Andrzej Bialecki     <><
> > ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> > 
> 
> ___________________________________
> Mischa Tuffield
> Email: mischa.tuffi...@garlik.com
> Homepage - http://mmt.me.uk/
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
> 
                                          
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Reply via email to