On 6/3/05, Roland
Collins <[EMAIL PROTECTED]>
wrote:
Use Regular
Expressions!!! Also, there's no reason to pull down image
files, etc. and look for links in them since the content is binary, so skip
them! After rewriting your function using RE and ignoring images, it
seems
to run on average 3x faster at 3 levels deep. This should be almost
an
exponential savings relative to the depth of the spider due to the pruning
of the files pulled.
Attached is a CFC that contains a modified version of your
function. To use
it, initialize it and say go!
<cfset spider = createObject("component",
"Spider").init("http://www.yoursite.com:80",
3)>
<cfset results = spider.get()>
<cfdump var="#results#">
This requires CF7. If you don't have CF7, replace
"local.httpResult.fileContent" with "cfhttp.fileContent"
and remove
result=" local.httpResult" from the cfhttp tag.
Roland
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto: [EMAIL PROTECTED]]
On Behalf
Of [EMAIL PROTECTED]
Sent: Friday, June 03, 2005 5:36 PM
To: [email protected]
Subject: [CFCDev] Spider
I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website. Its a fairly simple
program but awful slow. I create a page list a structure called
request.tree. Here is the function
<cffunction name="get">
<cfargument
name="incomingURL" type="string">
<cfset
var local=structNew()>
<cfhttp
url="">
resolveurl="yes"/>
<cfscript>
local.fileContent=cfhttp.fileContent;
request.tree
[arguments.incomingURL] = structnew();
request.tree[arguments.incomingURL].linksArray=arraynew(1);
request.tree[arguments.incomingURL].hash=hash(local.fileContent);
local.startLink
=
findnocase('http://',local.fileContent ,1);
while
(local.startLink)
{
local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnoca
se('
',local.fileContent,local.startLink));
local.link=trim(mid(local.fileContent ,local.startLink,local.endlink-local.st
artLink));
local.link=replace(local.link,chr(34),'',"ALL");
local.link=replace
(local.link,'>','',"ALL");
local.link=replace(local.link,chr(32),'',"ALL");
arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
if
( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
{
get(incomingURL=local.link,level=arguments.level+1);
}
local.startLink=findnocase('http://',local.fileContent,local.endlink);
}
</cfscript>
<cfreturn
/>
</cffunction>
Unfortunately, its painstakingly slow even for fairly simple
sites. Can
anybody make any suggestions?
Jason Cronk
[EMAIL PROTECTED]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words
'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org)
and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words
'unsubscribe cfcdev' as the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org)
and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]