[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907203#action_12907203
 ] 

Karl Wright commented on CONNECTORS-104:


For someone who is purportedly trying to make things simpler, you have 
specified a rather complex set of rules, many of which seem of questionable 
utility to me.

Since this is basically just a shortcut, I propose a simple feature that just 
limits all urls to hosts that are explicitly mentioned in the seeds.


> Make it easier to limit a web crawl to a single site
> 
>
> Key: CONNECTORS-104
> URL: https://issues.apache.org/jira/browse/CONNECTORS-104
> Project: Apache Connectors Framework
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Jack Krupansky
>Priority: Minor
>
> Unless the user explicitly enters an include regex carefully, a web crawl can 
> quickly get out of control and start crawling the entire web when all the 
> user may really want is to crawl just a single web site or portion thereof. 
> So, it would be preferable if either by default or with a simple button the 
> crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907201#action_12907201
 ] 

Jack Krupansky commented on CONNECTORS-104:
---

Simple works best. This enhancement is primarily for the simple use case where 
a "novice" user tries to do what they think is obvious ("crawl the web pages at 
this URL"), but without considering all of the potential nuances or how to 
fully specify the details of their goal.

One nuance is whether subdomains are considered part of the domain. I would say 
"no" if a subdomain was specified by the user and "yes" if no subdomain was 
specified.

Another nuance is whether a "path" is specified to select a subset of a domain. 
It would be nice to handle that and (optionally) limit the crawl to that path 
(or sub-paths below it). An example would be to crawl the news archive for a 
site.


> Make it easier to limit a web crawl to a single site
> 
>
> Key: CONNECTORS-104
> URL: https://issues.apache.org/jira/browse/CONNECTORS-104
> Project: Apache Connectors Framework
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Jack Krupansky
>Priority: Minor
>
> Unless the user explicitly enters an include regex carefully, a web crawl can 
> quickly get out of control and start crawling the entire web when all the 
> user may really want is to crawl just a single web site or portion thereof. 
> So, it would be preferable if either by default or with a simple button the 
> crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907154#action_12907154
 ] 

Karl Wright commented on CONNECTORS-104:


Trying to limit to the seed domains automatically would, I think, cause more 
confusion than help.  I can, however, imagine introducing a checkbox on the 
"Inclusions" tab that, if checked, would limit the crawl to just the domains 
represented by the seeds, and even making it checked by default.  The implied 
regular expression would be:

^http[?s]://[/$\?]

for each seed, I believe.  (That's potentially a lot of regular expressions if 
the number of seeds is large, so obviously the logic wouldn't be using regexp's 
in practice.)


> Make it easier to limit a web crawl to a single site
> 
>
> Key: CONNECTORS-104
> URL: https://issues.apache.org/jira/browse/CONNECTORS-104
> Project: Apache Connectors Framework
>  Issue Type: Improvement
>  Components: Web connector
>Reporter: Jack Krupansky
>Priority: Minor
>
> Unless the user explicitly enters an include regex carefully, a web crawl can 
> quickly get out of control and start crawling the entire web when all the 
> user may really want is to crawl just a single web site or portion thereof. 
> So, it would be preferable if either by default or with a simple button the 
> crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.