[
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723981#comment-16723981
]
Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM:
----------------------------------------------------------------------
[[email protected]] - So then with the seed map URL:
we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES
output, web input and Hop-count 1 for links and 0 for redirect:
# run job
# +-29000 documents get pushed to ES
# sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
# wait till scheduled time
# run job
# documents get add/deleted (e.g.: 10 documents deleted)
# wait till scheduled time
# ...
Last time we tried this manifold started acting strange because of the amount
of url's/links located in the sitemap URL
(sitemap url:
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en&html=true])
(blacklist url's:
[https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en&html=true])
was (Author: steenti):
[[email protected]] - So then with the seed map URL:
we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES
output, web input and Hop-count 1 for links and 0 for redirect:
# run job
# +-29000 documents get pushed to ES
# sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
# wait till scheduled time
# run job
# documents get add/deleted (e.g.: 10 documents deleted)
# wait till scheduled time
# ...
Last time we tried this manifold started acting strange because of the amount
of url's/links located in the sitemap URL
(sitemap url:
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en&html=true])
> Documents unreachable due to hopcount are not considered unreachable on
> cleanup pass
> ------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
> Issue Type: Bug
> Components: Elastic Search connector, Web connector
> Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
> Reporter: Tim Steenbeke
> Assignee: Karl Wright
> Priority: Critical
> Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init,
> manifoldcf.log.reduced
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)