[
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723782#comment-16723782
]
Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM:
---------------------------------------------------------------------
[[email protected]]
If we update to manifold 2.12 can we than use the seedmap as originaly intended
by us ?
so we create a job with X seeds, ES output, web input and HopCount 0 for links
and redirect:
# Put X seeds in seedmap
# run job
# X documents get pushed to ES
# update job to have X minus 20 seeds
wait till scheduled time
# run job
# 20 documents get deleted from ES
# X minus 20 documents get updated
# wait till scheduled time
# ...
Will it work like this ?
was (Author: steenti):
If we update to manifold 2.12 can we than use the seedmap as originaly intended
by us ?
so we create a job with X seeds, ES output, web input and HopCount 0 for links
and redirect:
# Put X seeds in seedmap
# run job
# X documents get pushed to ES
# update job to have X minus 20 seeds
wait till scheduled time
# run job
# 20 documents get deleted from ES
# X minus 20 documents get updated
# wait till scheduled time
# ...
Will it work like this ?
> Documents unreachable due to hopcount are not considered unreachable on
> cleanup pass
> ------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
> Issue Type: Bug
> Components: Elastic Search connector, Web connector
> Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
> Reporter: Tim Steenbeke
> Assignee: Karl Wright
> Priority: Critical
> Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init,
> manifoldcf.log.reduced
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)