[jira] [Commented] (CONNECTORS-104) Make it easier to limit a web crawl to a single site

Steph van Schalkwyk (JIRA) Thu, 23 Aug 2018 15:28:48 -0700


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590882#comment-16590882
 ]


Steph van Schalkwyk commented on CONNECTORS-104:
------------------------------------------------

I'm running into a seeding issue on 2.10:

Seed 
[http://inside.xxx.net/inside/pages/elastic_test/|http://inside.rrd.net/insideRRD/pages/elastic_test/]

starts to crawl 
[http://inside.rrd.net/inside/pages/|http://inside.rrd.net/insideRRD/pages/] 
and seems to ignore the last folder restriction.

I try to use these as "include in crawl/include in index" filters, but then I 
get nothing crawled:

http:\/\/inside.xxx.net\/inside\/pages\/elastic_test\/.*

http:\/\/inside\.xxx\.net\/inside\/pages\/elastic_test\/.*

What am I doing wrong? I know I've deployed this same config to many, many 
sites.

> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>
>                 Key: CONNECTORS-104
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Jack Krupansky
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 0.1
>
>
> Unless the user explicitly enters an include regex carefully, a web crawl can 
> quickly get out of control and start crawling the entire web when all the 
> user may really want is to crawl just a single web site or portion thereof. 
> So, it would be preferable if either by default or with a simple button the 
> crawl could be limited to the seed web site(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-104) Make it easier to limit a web crawl to a single site

Reply via email to