Re: OPIC score calculation issues

2006-03-14 Thread Andrzej Bialecki

(Better late than never... I forgot I didn't yet respond to your posting).

Doug Cutting wrote:
I think all that you're saying is that we should not run two CrawlDB 
updates at once, right?  But there are lots of reasons we cannot do 
that besides the OPIC calculation.


When we used WebDB it was possible to overlap generate / fetch / update 
cycles, because we would lock pages selected by FetchListTool for a 
period of time.


Now we don't do this. The advantage is that we don't have to rewrite 
CrawlDB. But operations on CrawlDB are considerably faster than on 
WebDB, perhaps we should consider going back to this method?




Also, the cash value of those outlinks that point to URLs not in 
the current fetchlist will be dropped, because they won't be 
collected anywhere.


No, every cash value is used.  The input to a crawl db update includes a
CrawlDatum for every known url, including those just linked to.  If 
the only CrawlDatum for a url is a new outlink from a page crawled, 
then the  score for the page is 1.0 + the score of that outlink.


Of course, you are right, I missed this.


And a final note: CrawlDB.update() uses the initial score value 
recorded in the segment, and NOT the value that is actually found in 
CrawlDB at the time of the update. This means that if there was 
another update in the meantime, your new score in CrawlDB will be 
overwritten with the score based on an older initial value. This is 
counter-intuitive - I think CrawlDB.update() should always use the 
latest score value found in the current CrawlDB. I.e. in 
CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);


The change is not quite that simple, since 'old' is sometimes null.  
So perhaps we need to add an 'score' variable that is set to old.score 
when old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing 
acted on.  But indeed, if you have another process that's updating a 
CrawlDb while a Fetcher is running, this may not be the case.  So if 
we want to switch things so that the Fetcher is not permitted to 
adjust scores, then this seems like a reasonable change.


I would vote for implementing this change. The reason is that the active 
actor that computes new scores is CrawlDb.update(). Fetcher may provide 
additional information to affect the score, but IMHO the logic to 
calculate new scores should be concentrated in the update() method.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: OPIC score calculation issues

2006-03-14 Thread Doug Cutting

Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update 
cycles, because we would lock pages selected by FetchListTool for a 
period of time.


Now we don't do this. The advantage is that we don't have to rewrite 
CrawlDB. But operations on CrawlDB are considerably faster than on 
WebDB, perhaps we should consider going back to this method?


Yes, this would be a good addition.

Ideally we should change Crawl.java to overlap these too.  When -topN is 
specified and substantially smaller than the total size of the crawldb, 
then we can generate, start a fetch job, then generate again.  As each 
fetch completes, we can start the next, then run an update and generate 
based on the just-completed fetch, so that we're constantly fetching.


This could be implemented by: (a) adding a status for generated crawl 
data; (b) adding a option to updatedb to include the generated output 
from some segments.  Then, in the above algorithm, the first time we'd 
update with only the generator output, but, after that, we can combine 
the updates with fetcher and generator output.  This way, in the course 
of a crawl, we only re-write the crawldb one additional time, rather 
than twice as many times.  Does this make sense?


And a final note: CrawlDB.update() uses the initial score value 
recorded in the segment, and NOT the value that is actually found in 
CrawlDB at the time of the update. This means that if there was 
another update in the meantime, your new score in CrawlDB will be 
overwritten with the score based on an older initial value. This is 
counter-intuitive - I think CrawlDB.update() should always use the 
latest score value found in the current CrawlDB. I.e. in 
CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);



The change is not quite that simple, since 'old' is sometimes null.  
So perhaps we need to add an 'score' variable that is set to old.score 
when old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing 
acted on.  But indeed, if you have another process that's updating a 
CrawlDb while a Fetcher is running, this may not be the case.  So if 
we want to switch things so that the Fetcher is not permitted to 
adjust scores, then this seems like a reasonable change.


I would vote for implementing this change. The reason is that the active 
actor that computes new scores is CrawlDb.update(). Fetcher may provide 
additional information to affect the score, but IMHO the logic to 
calculate new scores should be concentrated in the update() method.


I agree: +1.  I was just trying to explain the existing logic.  I think 
this would provide a significant improvement, with little lost.


Doug


Re: OPIC score calculation issues

2006-02-28 Thread Doug Cutting

Andrzej Bialecki wrote:
* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s 
from crawl_parse with the same URL, which means that we get:


   * the original CrawlDatum
   * (optionally a CrawlDatum that contains just a Signature)
   * all CrawlDatum.LINKED entries pointing to our URL, generated by 
outlinks from other

 pages.

 Based on this information, a new score is calculated by adding the 
original score and all

 scores from incoming links.

HOWEVER... and here's where I suspect the current code is wrong: since 
we are processing just one segment the incoming link information is very 
incomplete because it comes only from the outlinks discovered by 
fetching this segment's fetchlist, and not the complete LinkDB.


I think the code is correct.  OPIC is an incremental algorithm, designed 
to be calculated while crawling.  As each new link is seen, it 
increments the score of the page it links to.  OPIC is thus much simpler 
and faster to calculate than PageRank.  (It also provides a good 
approximation of PageRank, but prioritizes better when crawling than 
PageRank.  Crawling using an incrementally calculated PageRank is not as 
good as OPIC at crawling higher PageRank pages sooner.)


One mitigating factor could be that we already accounted for incoming 
links from other segments when processing those other segments - so our 
initial score already includes the inlink information from other 
segments. But this assumes that we never generate and process more than 
1 segment in parallel, i.e. that we finish updating from all previous 
segments before we update from the current segment (otherwise we 
wouldn't know the updated initial score).


I think all that you're saying is that we should not run two CrawlDB 
updates at once, right?  But there are lots of reasons we cannot do that 
besides the OPIC calculation.


Also, the cash value of those outlinks that point to URLs not in the 
current fetchlist will be dropped, because they won't be collected 
anywhere.


No, every cash value is used.  The input to a crawl db update includes a
CrawlDatum for every known url, including those just linked to.  If the 
only CrawlDatum for a url is a new outlink from a page crawled, then the 
 score for the page is 1.0 + the score of that outlink.


I think a better option would be to add the LinkDB as an input dir to 
CrawlDB.update(), so that we have access to all previously collected 
inlinks.


That would be a lot slower, and it would not compute OPIC.

And a final note: CrawlDB.update() uses the initial score value recorded 
in the segment, and NOT the value that is actually found in CrawlDB at 
the time of the update. This means that if there was another update in 
the meantime, your new score in CrawlDB will be overwritten with the 
score based on an older initial value. This is counter-intuitive - I 
think CrawlDB.update() should always use the latest score value found in 
the current CrawlDB. I.e. in CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);


The change is not quite that simple, since 'old' is sometimes null.  So 
perhaps we need to add an 'score' variable that is set to old.score when 
old!=null and to 1.0 otherwise (for newly linked pages).


The reason I didn't do it that way was to permit the Fetcher to modify 
scores, since I was thinking of the Fetcher as the actor whose actions 
are being processed here, and of the CrawlDb as the passive thing acted 
on.  But indeed, if you have another process that's updating a CrawlDb 
while a Fetcher is running, this may not be the case.  So if we want to 
switch things so that the Fetcher is not permitted to adjust scores, 
then this seems like a reasonable change.


Doug


OPIC score calculation issues

2006-02-27 Thread Andrzej Bialecki

Hi all,

Score calculations in trunk/ follow a different model than in 0.7.x. I'd 
like to start a discussion on this - it was a significant change, and my 
feeling is that its consequences are not completely clear. I have some 
doubts myself, and I'd like to be sure what is going on.


A quick background: where Nutch 0.7.x used a variant of PageRank, Nutch 
0.8 uses so called Online Page Index Calculation (OPIC) method. The 
OPIC score calculations are based on the idea that each page is 
initially assigned a certain cash value, and it can distribute parts 
of its cash value to its outgoing links. At the same time, the page 
receives cash value from any incoming links pointing to it. This way we 
can adjust the page importance based on the quality of pages linked 
to/from it.


In terms of the current Nutch implementation:

* Injector: records the initial cash value of a page when it's injected 
to CrawlDB. The default value is implicitly set to 1.0f (see the 
declaration of private float score in CrawlDatum - not too smart, IMO 
this value should be set explicitly by Injector, based on a config 
property or cmd-line arguments).


* Generator: selects some pages (i.e. CrawlDatum with the initial score) 
and puts them in crawl_generate.


* Fetcher processes these CrawlDatum-s and passes these initial cash 
values down, to be recorded in crawl_parse.


* Parsing (invoked either from Fetcher or from ParseSegment) discovers 
outlinks, and records them using ParseOutputFormat, where each outlink 
ends up under the outlink's target URL in crawl_parse (as 
CrawlDatum.LINKED). It also sets their scores to a fraction of the 
originating page's score (i.e. outlink_score = orig_page_score / 
num_outlinks).


* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s 
from crawl_parse with the same URL, which means that we get:


   * the original CrawlDatum
   * (optionally a CrawlDatum that contains just a Signature)
   * all CrawlDatum.LINKED entries pointing to our URL, generated by 
outlinks from other

 pages.

 Based on this information, a new score is calculated by adding the 
original score and all

 scores from incoming links.

HOWEVER... and here's where I suspect the current code is wrong: since 
we are processing just one segment the incoming link information is very 
incomplete because it comes only from the outlinks discovered by 
fetching this segment's fetchlist, and not the complete LinkDB.


One mitigating factor could be that we already accounted for incoming 
links from other segments when processing those other segments - so our 
initial score already includes the inlink information from other 
segments. But this assumes that we never generate and process more than 
1 segment in parallel, i.e. that we finish updating from all previous 
segments before we update from the current segment (otherwise we 
wouldn't know the updated initial score).


Also, the cash value of those outlinks that point to URLs not in the 
current fetchlist will be dropped, because they won't be collected 
anywhere. LinkDB doesn't store the link cash values, they are not 
stored in CrawlDB either.


I think a better option would be to add the LinkDB as an input dir to 
CrawlDB.update(), so that we have access to all previously collected 
inlinks. The problem is however that we don't keep the contributing 
cash values per inlink... but we could add this value to each Inlink 
when we run LinkDB.invertlinks().


And a final note: CrawlDB.update() uses the initial score value recorded 
in the segment, and NOT the value that is actually found in CrawlDB at 
the time of the update. This means that if there was another update in 
the meantime, your new score in CrawlDB will be overwritten with the 
score based on an older initial value. This is counter-intuitive - I 
think CrawlDB.update() should always use the latest score value found in 
the current CrawlDB. I.e. in CrawlDBReducer instead of doing:


 result.setScore(result.getScore() + scoreIncrement);

we should do:

 result.setScore(old.getScore() + scoreIncrement);

-

I hope all this is not too confusing ... I hope I'm not the one who is 
hopelessly confused ;-)


Any comments, corrections and suggestions appreciated!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com