Assume that we have no restriction for max.inlinks, and we have two crawl namely crawl_depth1 than continue same crawl with crawl_depth2. There are two cases for obtainning final linkdb. First one is run
./nutch invertlinks linkdb_depth1 segment_depth1 ./nutch invertlinks linkdb_depth2 segment_depth2 ./nutch mergelinkdb final_linkdb_1 linkdb_depth1 linkdb_depth2 and second one is run. /nutch invertlinks final_linkdb2 segment_depth1 segment_depth2 is there any differenece between final_linkdb1 and final_linkdb2 ? I mean Is merge operation is loosless in this case? Andrzej Bialecki wrote: > Murat Ali Bayir wrote: > >> Hi everbody, I want to know how mergelinkdb function works. Assume >> that we have two linkdb in the first one >> the URLx is referred by URLa, URLb and URLc in the second one same >> URLx is refferred by URLa, URLk. I want to >> know structure of the output linkdb. >> does it contains one entry for URLx referred by URLa, URLb, URLc and >> URLk or >> just append second linkdb to first one and contains two entry for >> URLx given below >> URLx <- URLa URLb, URLc and >> .. >> .. >> .. >> URLx <- URLa URLk >> >> > > No, these two entries are merged into one (that's why the name :) ). > At any given time, in a valid linkdb there is exactly zero or one > entries for any given target URL. > > You should note that there is a limit set on how many inlinks we are > going to store for any given URL (db.max.inlinks) - which may lead to > some surprises. If e.g. the linkdbA already hit that limit, and the > other linkdbB didn't, then two scenarios are possible - either you get > the list just containing all links from linkdbA and none from linkdbB, > or you get the list containing all links from linkdbB plus some links > from linkdbA ... > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
