[ https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826707#comment-16826707 ]
Donald Van den Driessche commented on CONNECTORS-1602: ------------------------------------------------------ Ok, thanks. That already clears some things up. How does Manifold know a document doesn't change that often if it isn't crawled? If a full crawling takes about 8 hours, but you make your recrawl intervals smaller than that. Will it start recrawling before the job has completed a full run? And if so, may that interfere with the termination of the job? So that it might not get to a full run? > Continuous crawling doesn't recrawl everything > ---------------------------------------------- > > Key: CONNECTORS-1602 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1602 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Reporter: Donald Van den Driessche > Priority: Major > > When crawling a website in continuous crawling mode we saw that not all > documents are recrawled. > The site is quite extensive. We figured out that after crawling a > document/page gets a recrawl timestamp in between the recrawl interval and > max recrawl interval. > But if these values occur within the first crawl, Manifold starts recrawling > those, but seems to ignore the rest of the website. Also sometimes documents > get recrawled 5 times while other don't get recrawled. Apparently due to the > same issue. > > Is it possible to shed a bit more light on the continuous crawling? > Is it a good system to use for crawling a (extensive) website? -- This message was sent by Atlassian JIRA (v7.6.3#76005)