[
https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432206#comment-13432206
]
Karl Wright commented on CONNECTORS-501:
----------------------------------------
I think I see the scenario where things go wrong. It goes like this:
(1) Imagine (a) -> (b) -> (c)
(2) We take the long route to (b) and the short route to (c), but (c) is still
out of the running and is deleted
(3) We find a better route to (b) and that decreases the hopcount for (c) but
(b) is not recrawled, because nothing important has changed, and therefore (c)
is not requeued
One possible fix for this scenario involves repeating (b) if its hopcount
decreases. This, however, will mean a tremendous amount of recrawling to catch
not too many outlying documents. A subsequent job run might also at least
converge towards the proper number. I'll have to ponder what kind of solution
we can implement and afford for the hopcount feature.
> Medium-scale web crawl with hopcount-based filtering fails to find correct
> number of documents
> ----------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-501
> URL: https://issues.apache.org/jira/browse/CONNECTORS-501
> Project: ManifoldCF
> Issue Type: Bug
> Components: Framework agents process, Web connector
> Affects Versions: ManifoldCF 0.6
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.7
>
> Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based
> filtering, does not discover all 11110 documents it is supposed to. It only
> discovered 10603 when I ran it just now.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira