plz mischa, if your problem is not about delete duplicate just open another thread ! thx
Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages. i will give you news when done. > From: mischa.tuffi...@garlik.com > Subject: Re: dedup dont delete duplicates ! > Date: Wed, 25 Nov 2009 11:45:21 +0000 > To: nutch-user@lucene.apache.org > > Hello All, > > I am getting the following error in my hadoop.log (see below). It seems to > happen everytime I run any of the nutch command line tools :( > > <!-- > > 2009-11-25 11:42:49,299 INFO crawl.Injector - Injector: done > 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient - > leasechec...@dfsclient[clientname=dfsclient_-822770266, ugi=nutch,nutch]: > java.lang.Throwable: for testing > at > org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:992) > at java.lang.String.valueOf(String.java:2827) > at java.lang.StringBuilder.append(StringBuilder.java:115) > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:981) > at java.lang.Thread.run(Thread.java:619) > is interrupted. > java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:978) > at java.lang.Thread.run(Thread.java:619) > > --> > > Does anyone know what problem I am having ? > > Cheers, > > Mischa > > On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote: > > > BELLINI ADAM wrote: > >> hi, > >> my two urls points to the same page ! > > > > Please, no need to shout ... > > > > If the MD5 signatures are different, then the binary content of these pages > > is different, period. > > > > Use readseg -dump utility to retrieve the page content from the segment, > > extract just the two pages from the dump, and run a unix diff utility. > > > >> can you tell m eplz more about TextProfileSignature ? how should i > >> use it > > > > Configure this type of signature in your nutch-site.xml - please see the > > nutch-default.xml for instructions. Please note that you will have to > > re-parse segments and update the db in order to update the signatures. > > > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > ___________________________________ > Mischa Tuffield > Email: mischa.tuffi...@garlik.com > Homepage - http://mmt.me.uk/ > Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK > +44(0)20 8973 2465 http://www.garlik.com/ > Registered in England and Wales 535 7233 VAT # 849 0517 11 > Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD > _________________________________________________________________ Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now! http://go.microsoft.com/?linkid=9691819