[
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590882#comment-16590882
]
Steph van Schalkwyk commented on CONNECTORS-104:
------------------------------------------------
I'm running into a seeding issue on 2.10:
Seed
[http://inside.xxx.net/inside/pages/elastic_test/|http://inside.rrd.net/insideRRD/pages/elastic_test/]
starts to crawl
[http://inside.rrd.net/inside/pages/|http://inside.rrd.net/insideRRD/pages/]
and seems to ignore the last folder restriction.
I try to use these as "include in crawl/include in index" filters, but then I
get nothing crawled:
http:\/\/inside.xxx.net\/inside\/pages\/elastic_test\/.*
http:\/\/inside\.xxx\.net\/inside\/pages\/elastic_test\/.*
What am I doing wrong? I know I've deployed this same config to many, many
sites.
> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>
> Key: CONNECTORS-104
> URL: https://issues.apache.org/jira/browse/CONNECTORS-104
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Web connector
> Reporter: Jack Krupansky
> Assignee: Karl Wright
> Priority: Minor
> Fix For: ManifoldCF 0.1
>
>
> Unless the user explicitly enters an include regex carefully, a web crawl can
> quickly get out of control and start crawling the entire web when all the
> user may really want is to crawl just a single web site or portion thereof.
> So, it would be preferable if either by default or with a simple button the
> crawl could be limited to the seed web site(s).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)