[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Karl Wright (JIRA) Thu, 09 Aug 2012 14:59:22 -0700

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432206#comment-13432206
 ]


Karl Wright commented on CONNECTORS-501:
----------------------------------------

I think I see the scenario where things go wrong.  It goes like this:

(1) Imagine (a) -> (b) -> (c)
(2) We take the long route to (b) and the short route to (c), but (c) is still 
out of the running and is deleted
(3) We find a better route to (b) and that decreases the hopcount for (c) but 
(b) is not recrawled, because nothing important has changed, and therefore (c) 
is not requeued

One possible fix for this scenario involves repeating (b) if its hopcount 
decreases.  This, however, will mean a tremendous amount of recrawling to catch 
not too many outlying documents.  A subsequent job run might also at least 
converge towards the proper number.  I'll have to ponder what kind of solution 
we can implement and afford for the hopcount feature.

                
> Medium-scale web crawl with hopcount-based filtering fails to find correct 
> number of documents
> ----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-501
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-501
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Framework agents process, Web connector
>    Affects Versions: ManifoldCF 0.6
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 0.7
>
>         Attachments: capture.txt
>
>
> The new web crawler Postgresql load test, which uses hopcount-based 
> filtering, does not discover all 11110 documents it is supposed to.  It only 
> discovered 10603 when I ran it just now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CONNECTORS-501) Medium-scale web crawl with hopcount-based filtering fails to find correct number of documents

Reply via email to