RE: A CF limitation in building a spider?

Scott, Andrew Thu, 23 Nov 2000 22:07:37 -0800
Hmmm, spider logic is always changing... However, if you were to return a
page like thus you would be doing logic on it for href's and the like to
then pull up other links to go to!!!!

Why couldn't you just add another condition to then see if the page is being
refresh with any of the known methods!!!


regards

Andrew Scott
Senior Cold Fusion Application Developer
ANZ eCommerce Centre
* Ph 9273 0693  
* [EMAIL PROTECTED]


-----Original Message-----
From: Michael Thomas [mailto:[EMAIL PROTECTED]]
Sent: 23 November 2000 17:11
To: CF-Talk
Subject: Re: A CF limitation in building a spider?


Ive noticed one huge limitation that makes CF not a great choice in building

a spider. CFHTTP doesnt like the use of URL pointers to another domain, like

http://www.thisdomain.com redirects to http://www.thisotherdomain.com... In 
testing whether a domain is valid or not, an url pointer will be counted as 
a *Dead* Link. CFHTTP will fail for the simple fact that it didnt find a 
site, even if there really is one. The problem also has nothing to do with 
timing out either.

Timing out is also an issue to be concerned with. It seems some sites just 
wont comply with CFHTTP. To bring another matter to attention, have you ever

seen a page when using RESOLVEURL="yes" that came back with all the links 
intact??? I cant say I have.

Those are just a couple issues ive had to deal with that I thought you might

want to consider before hand. Good Luck.

Sincerely,
Mike

>From: James Sleeman <[EMAIL PROTECTED]>
>Reply-To: [EMAIL PROTECTED]
>To: CF-Talk <[EMAIL PROTECTED]>
>Subject: Re: A CF limitation in building a spider?
>Date: Thu, 23 Nov 2000 18:06:21 +1300
>
>---Reply to mail from Phill Gibson about A CF limitation in building a 
>spider?
>
> > Does anyone know of a better way to do the recursive call to cfhttp? 
>Anyway
> > I see it, you are still calling one page, and it eventually times out.
>
>Others have pointed out that CF is not a great tool for the task, but I'll
>point out a way that it could be done.  Using CFSCHEDULE you could
>schedule URL's to be explored something like this...
>
>     explore.cfm takes URL to explore 
>(http://my.server/explore.cfm?URL=...")
>         grabs URL (CFHTTP)
>         indexes URL
>         pulls out URL's in URL
>         schedules explore.cfm to run with each URL collected (CFSCHEDULE)
>         finished
>
>Not only is it not directly recursive and won't likely time out (each 
>request only
>grabs and indexes one URL), but it would concievably (depending on your
>scheduling algorithm) explore in URLS in parallel.  You would need to be
>careful with the scheduling algorithm though so as not to bomb your server
>with too many simultaneous scheduled tasks.
>
>
>---
>James Sleeman
>[EMAIL PROTECTED] (home)
>[EMAIL PROTECTED] (work)
>
>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>Structure your ColdFusion code with Fusebox. Get the official book at 
>http://www.fusionauthority.com/bkinfo.cfm
>
>Archives: http://www.mail-archive.com/[email protected]/
>Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

____________________________________________________________________________
_________
Get more from the Web.  FREE MSN Explorer download : http://explorer.msn.com

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists
RE: A CF limitation in building a spider?

Reply via email to