This may be the case if we run nutch only once on crawled directory.What I
am doing is that I am running nutch Crawl tool on already existed crawled
directory by modifying CrawlTool a little bit. That is if db directory
already exist it do not create it and neither return any error messages. But
after doing this if see the content of files say "db\webdb\linksByMD5" it is
nearly triple to what was earlier after single run. How it is possible to
run nutch more than once on same crawled directory? Do u think I am wrong
somewhere in my approach...
      Answer awaited...


On 12/16/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> The web DB itself handle duplicate urls by ignoring the duplicates.
> So incase you inject 100 times yahoo.in the webdb will only have one
> entry.
>
>
> Am 16.12.2005 um 09:04 schrieb Arun Kumar Sharma:
>
> > Hi
> >       I have list of urls which may contain duplicate urls. I want
> > to check that there is no duplicate url insertion through
> > WebDBInjector. Is there any way to achieve this using nutch
> > functionality???
> >      answer awaited anxiously...
> >
> >
> > Regards,
> >
> > Arun Kumar Sharma (Tech Lead -Java/J2EE)
> > Mob: +91.981.529.5761
> >
> >
> >
> >
> > Send instant messages to your online friends http://
> > in.messenger.yahoo.com
>
>

Reply via email to