Thanks Reinhard. I checked this, but both the files are same.

Just to elaborate more, I am downloading images using Nutch, so I have changed 
both files and removed jpg, gif, png etc from extensions to be skipped. What I 
see is that if I use "crawl" command, I get all image URLs in LinkDB, but if I 
execute commands separately I see only absolute links to images. All relative 
links are missing from LinkDB. (i.e. If HTML page has URL like 
"http://www.abc.com/img/img.jpg"; for image, I can see it in LinkDB in both 
cases, but if it has URL like "/img/img.jpg" for image, it's missing from 
LinkDB in case of execution using separate commands.)

Any thoughts?

TIA,
--Hrishi

-----Original Message-----
From: reinhard schwab [mailto:[email protected]]
Sent: Tuesday, September 01, 2009 3:19 PM
To: [email protected]
Subject: Re: LinkDB size difference

you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.

how to dump the linkdb:

reinh...@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out




Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same 
> URLs with "crawl" command(intranet crawling) as compared to running 
> individual commands (like inject, generate, fetch, invertlink etc i.e. 
> Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with 
> "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>
>


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to