[
https://issues.apache.org/jira/browse/CONNECTORS-501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430506#comment-13430506
]
Karl Wright commented on CONNECTORS-501:
----------------------------------------
Here's a potential race:
(1) There are two paths to get to a document, one longer, and one shorter.
(2) The first worker thread picks up the document after the longer path has
queued it up, and decides to delete it
(3) Before the document is deleted, however, the shorter path is evaluated in a
different thread and tries to queue it up
(4) The first thread deletes the document anyway
We had a similar race condition with carrydown data, and fixed it by detecting
the potential conflict (in that case by noting a change in the carrydown
information the document would see, plus the document being in the "active"
state). We need to do something similar for hopcount I think.
> Medium-scale web crawl with hopcount-based filtering fails to find correct
> number of documents
> ----------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-501
> URL: https://issues.apache.org/jira/browse/CONNECTORS-501
> Project: ManifoldCF
> Issue Type: Bug
> Components: Framework agents process, Web connector
> Affects Versions: ManifoldCF 0.6
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.7
>
>
> The new web crawler Postgresql load test, which uses hopcount-based
> filtering, does not discover all 11110 documents it is supposed to. It only
> discovered 10603 when I ran it just now.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira