[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907201#action_12907201
 ] 

Jack Krupansky commented on CONNECTORS-104:
-------------------------------------------

Simple works best. This enhancement is primarily for the simple use case where 
a "novice" user tries to do what they think is obvious ("crawl the web pages at 
this URL"), but without considering all of the potential nuances or how to 
fully specify the details of their goal.

One nuance is whether subdomains are considered part of the domain. I would say 
"no" if a subdomain was specified by the user and "yes" if no subdomain was 
specified.

Another nuance is whether a "path" is specified to select a subset of a domain. 
It would be nice to handle that and (optionally) limit the crawl to that path 
(or sub-paths below it). An example would be to crawl the news archive for a 
site.


> Make it easier to limit a web crawl to a single site
> ----------------------------------------------------
>
>                 Key: CONNECTORS-104
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
>             Project: Apache Connectors Framework
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Jack Krupansky
>            Priority: Minor
>
> Unless the user explicitly enters an include regex carefully, a web crawl can 
> quickly get out of control and start crawling the entire web when all the 
> user may really want is to crawl just a single web site or portion thereof. 
> So, it would be preferable if either by default or with a simple button the 
> crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to