[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826727#comment-16826727 ]
Donald Van den Driessche commented on CONNECTORS-1602: ------------------------------------------------------ Thanks. I know it runs continuous, but I'm wondering what happens if the recrawl timestamp is reached for documents. Will it first recrawl and then continue crawling, of contiunue crawling and then do the recrawl, or simultaneously crawl and recrawl? The last might slow down the crwaling speed. > Continuous crawling doesn't recrawl everything > ---------------------------------------------- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Reporter: Donald Van den Driessche > Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)