Re: linkdbmerge

Murat Ali Bayir Tue, 01 Aug 2006 23:56:42 -0700

Assume that we have no restriction for max.inlinks, and we have twocrawl namely crawl_depth1 than continue same crawl with crawl_depth2.There are two cases for obtainning final linkdb.

First one is run

./nutch invertlinks linkdb_depth1 segment_depth1
./nutch invertlinks linkdb_depth2 segment_depth2
./nutch mergelinkdb final_linkdb_1 linkdb_depth1 linkdb_depth2


and second one is run.

/nutch invertlinks final_linkdb2 segment_depth1 segment_depth2

is there any differenece between final_linkdb1 and final_linkdb2 ? Imean Is merge operation is loosless in this case?



Andrzej Bialecki wrote:

Murat Ali Bayir wrote:
Hi everbody, I want to know how mergelinkdb function works. Assumethat we have two linkdb in the first onethe URLx is referred by URLa, URLb and URLc in the second one sameURLx is refferred by URLa, URLk. I want to
know structure of the output linkdb.
does it contains one entry for URLx referred by URLa, URLb, URLc andURLk orjust append second linkdb to first one and contains two entry forURLx given below
URLx <- URLa  URLb, URLc and
..
..
..
URLx <- URLa  URLk
No, these two entries are merged into one (that's why the name :) ).At any given time, in a valid linkdb there is exactly zero or oneentries for any given target URL.
You should note that there is a limit set on how many inlinks we aregoing to store for any given URL (db.max.inlinks) - which may lead tosome surprises. If e.g. the linkdbA already hit that limit, and theother linkdbB didn't, then two scenarios are possible - either you getthe list just containing all links from linkdbA and none from linkdbB,or you get the list containing all links from linkdbB plus some linksfrom linkdbA ...

Re: linkdbmerge

Reply via email to