the class can spider or there is another that can make an copy of any site that it can access; just give it the base url and bam..
http://www.acme.com/java/software/Acme.Spider.html
(here is an implementation of it as an applet: http://www.acme.com/java/software/WebList.html)
http://www.acme.com/java/software/WebCopy.html
Dont reinvent the wheel :O)
Bill
On 6/3/05, Roland Collins <[EMAIL PROTECTED]> wrote:
Use Regular Expressions!!! Also, there's no reason to pull down image
files, etc. and look for links in them since the content is binary, so skip
them! After rewriting your function using RE and ignoring images, it seems
to run on average 3x faster at 3 levels deep. This should be almost an
exponential savings relative to the depth of the spider due to the pruning
of the files pulled.
Attached is a CFC that contains a modified version of your function. To use
it, initialize it and say go!
<cfset spider = createObject("component",
"Spider").init("http://www.yoursite.com:80", 3)>
<cfset results = spider.get()>
<cfdump var="#results#">
This requires CF7. If you don't have CF7, replace
"local.httpResult.fileContent" with "cfhttp.fileContent" and remove
result=" local.httpResult" from the cfhttp tag.
Roland
-----Original Message-----
From: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED]] On Behalf
Of [EMAIL PROTECTED]
Sent: Friday, June 03, 2005 5:36 PM
To: [email protected]
Subject: [CFCDev] Spider
I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website. Its a fairly simple
program but awful slow. I create a page list a structure called
request.tree. Here is the function
<cffunction name="get">
<cfargument name="incomingURL" type="string">
<cfset var local=structNew()>
<cfhttp url="">resolveurl="yes"/>
<cfscript>
local.fileContent=cfhttp.fileContent;
request.tree [arguments.incomingURL] = structnew();
request.tree[arguments.incomingURL].linksArray=arraynew(1);
request.tree[arguments.incomingURL].hash=hash(local.fileContent);
local.startLink =
findnocase('http://',local.fileContent ,1);
while (local.startLink)
{
local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnoca
se('
',local.fileContent,local.startLink));
local.link=trim(mid(local.fileContent ,local.startLink,local.endlink-local.st
artLink));
local.link=replace(local.link,chr(34),'',"ALL");
local.link=replace (local.link,'>','',"ALL");
local.link=replace(local.link,chr(32),'',"ALL");
arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
if ( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
{
get(incomingURL=local.link,level=arguments.level+1);
}
local.startLink=findnocase('http://',local.fileContent,local.endlink);
}
</cfscript>
<cfreturn />
</cffunction>
Unfortunately, its painstakingly slow even for fairly simple sites. Can
anybody make any suggestions?
Jason Cronk
[EMAIL PROTECTED]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
--
[EMAIL PROTECTED]
http://blog.rawlinson.us
If you want Gmail - just ask. ----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
