Use Regular Expressions!!! Also, there's no reason to pull down image files, etc. and look for links in them since the content is binary, so skip them! After rewriting your function using RE and ignoring images, it seems to run on average 3x faster at 3 levels deep. This should be almost an exponential savings relative to the depth of the spider due to the pruning of the files pulled.
Attached is a CFC that contains a modified version of your function. To use
it, initialize it and say go!
<cfset spider = createObject("component",
"Spider").init("http://www.yoursite.com:80", 3)>
<cfset results = spider.get()>
<cfdump var="#results#">
This requires CF7. If you don't have CF7, replace
"local.httpResult.fileContent" with "cfhttp.fileContent" and remove
result="local.httpResult" from the cfhttp tag.
Roland
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of [EMAIL PROTECTED]
Sent: Friday, June 03, 2005 5:36 PM
To: [email protected]
Subject: [CFCDev] Spider
I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website. Its a fairly simple
program but awful slow. I create a page list a structure called
request.tree. Here is the function
<cffunction name="get">
<cfargument name="incomingURL" type="string">
<cfset var local=structNew()>
<cfhttp url="#arguments.incomingURL#" method="get"
resolveurl="yes"/>
<cfscript>
local.fileContent=cfhttp.fileContent;
request.tree[arguments.incomingURL] = structnew();
request.tree[arguments.incomingURL].linksArray=arraynew(1);
request.tree[arguments.incomingURL].hash=hash(local.fileContent);
local.startLink =
findnocase('http://',local.fileContent,1);
while (local.startLink)
{
local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnoca
se('
',local.fileContent,local.startLink));
local.link=trim(mid(local.fileContent,local.startLink,local.endlink-local.st
artLink));
local.link=replace(local.link,chr(34),'',"ALL");
local.link=replace(local.link,'>','',"ALL");
local.link=replace(local.link,chr(32),'',"ALL");
arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
if ( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
{
get(incomingURL=local.link,level=arguments.level+1);
}
local.startLink=findnocase('http://',local.fileContent,local.endlink);
}
</cfscript>
<cfreturn />
</cffunction>
Unfortunately, its painstakingly slow even for fairly simple sites. Can
anybody make any suggestions?
Jason Cronk
[EMAIL PROTECTED]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
Spider.cfc
Description: application/cfc
