But CF compiles to native java bytecode
and runs on the same JVM that the java version would anyway, so I don’t
know why you think it would be faster just because it’s a “native”
java implementation.
Roland
From:
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bill Rawlinson
Sent: Sunday, June 05, 2005 10:41
AM
To: [email protected]
Subject: Re: [CFCDev] Spider
i think depending on your
circumstances, any solution will be slow since it has to get each page and
parse it, but im sure without a lot of work the java solution could be made to
run much faster.
On 6/4/05, Roland
Collins <[EMAIL PROTECTED]>
wrote:
It is sloooooow ;)
there is a free java class library out there that does
this.
the class can spider or there is another that can make an copy of any site that
it can access; just give it the base url and bam..
http://www.acme.com/java/software/Acme.Spider.html
(here is an implementation of it as an applet: http://www.acme.com/java/software/WebList.html
)
http://www.acme.com/java/software/WebCopy.html
Dont reinvent the wheel :O)
Bill
On
6/3/05, Roland Collins <[EMAIL PROTECTED]> wrote:
Use Regular Expressions!!! Also, there's no
reason to pull down image
files, etc. and look for links in them since the content is binary, so skip
them! After rewriting your function using RE and ignoring images, it
seems
to run on average 3x faster at 3 levels deep. This should be almost
an
exponential savings relative to the depth of the spider due to the pruning
of the files pulled.
Attached is a CFC that contains a modified version of your
function. To use
it, initialize it and say go!
<cfset spider = createObject("component",
"Spider").init("http://www.yoursite.com:80", 3)>
<cfset results = spider.get()>
<cfdump var="#results#">
This requires CF7. If you don't have CF7, replace
"local.httpResult.fileContent" with "cfhttp.fileContent"
and remove
result=" local.httpResult" from the cfhttp tag.
Roland
-----Original Message-----
From: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED]] On Behalf
Of [EMAIL PROTECTED]
Sent: Friday, June 03, 2005 5:36 PM
To: [email protected]
Subject: [CFCDev] Spider
I'm trying to write a CFC that will spider a website and create an
inventory of all the pages/files on the website. Its a fairly simple
program but awful slow. I create a page list a structure called
request.tree. Here is the function
<cffunction name="get">
<cfargument
name="incomingURL" type="string">
<cfset
var local=structNew()>
<cfhttp
url="">
resolveurl="yes"/>
<cfscript>
local.fileContent=cfhttp.fileContent;
request.tree
[arguments.incomingURL] = structnew();
request.tree[arguments.incomingURL].linksArray=arraynew(1);
request.tree[arguments.incomingURL].hash=hash(local.fileContent);
local.startLink
=
findnocase('http://',local.fileContent ,1);
while
(local.startLink)
{
local.endlink=min(findnocase('>',local.fileContent,local.startLink),findnoca
se('
',local.fileContent,local.startLink));
local.link=trim(mid(local.fileContent ,local.startLink,local.endlink-local.st
artLink));
local.link=replace(local.link,chr(34),'',"ALL");
local.link=replace
(local.link,'>','',"ALL");
local.link=replace(local.link,chr(32),'',"ALL");
arrayappend(request.tree[arguments.incomingURL].linksArray,local.link);
if
( local.link contains request.base and not
structkeyexists(request.tree,local.link) )
{
get(incomingURL=local.link,level=arguments.level+1);
}
local.startLink=findnocase('http://',local.fileContent,local.endlink);
}
</cfscript>
<cfreturn
/>
</cffunction>
Unfortunately, its painstakingly slow even for fairly simple
sites. Can
anybody make any suggestions?
Jason Cronk
[EMAIL PROTECTED]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as
the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org)
and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as
the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org)
and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
If you want Gmail - just ask.
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as
the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
--
[EMAIL PROTECTED]
http://blog.rawlinson.us
If you want Gmail - just ask.
----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of the
email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]
---------------------------------------------------------- You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as the subject of the email.
CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com).
CFCDev is supported by New Atlanta, makers of BlueDragon http://www.newatlanta.com/products/bluedragon/index.cfm
An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
|