[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

Tim Steenbeke (JIRA) Tue, 11 Dec 2018 02:10:46 -0800


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716761#comment-16716761
 ]


Tim Steenbeke commented on CONNECTORS-1562:
-------------------------------------------

[~kwri...@metacarta.com] I have a URL with the full sitemap that has to be 
crawled ~^(and a full exclude sitemap)^~.
If i use this URL as seed, do I have to set the hop filters to any value (e.g. 
redirect:0 and link:1) ?
If one or multiple links are deleted from this sitemap, will the document be 
deleted from ES ?



How should I set up the job to only keep the crawled sites in the sitemap ?

 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

Reply via email to