RE: [CFCDev] Spider

Roland Collins Mon, 06 Jun 2005 11:24:36 -0700

Instead of hashing the documents then, just use the HEAD command for images.
It will return the last modified date - use this to track file changes
instead.  HEAD was designed to do this so that you don't waste other
people's bandwidth ;)

"HEAD: identical to the GET method, but the server does not send a message
body in the response. Use this method for testing hypertext links for
validity and accessibility, determining the type or modification time of a
document, or determining the type of server."

http://livedocs.macromedia.com/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/h
tml/wwhelp.htm?context=ColdFusion_Documentation&file=00000272.htm

You will still need to use GET for textual docs so you can spider the links,
but HEAD will work just fine for images, etc.  Instead of comparing hashes,
always compare the modified date.

Cheers,
Roland

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of [EMAIL PROTECTED]
Sent: Monday, June 06, 2005 10:13 AM
To: [email protected]
Subject: Re: [CFCDev] Spider

Thanks for the great replies folks.

Nathan wrote:
>Well, the first suggestion is don't use ColdFusion for this -- not
>really the best tool for the job.

Well when the only tool you have is a hammer......  I don't know perl or
I'd probably do it in that.

>Not sure why you necessarily need to keep a hash of the file contents,
>but that is likely slow for big chunks of HTML.

I need to monitor the site for changes.  If a page changes then the hash
will change too and I'll keep a new copy.

>You might find a simple regex to find links is faster than all the
>string manipulation you are doing.

I definately will do this.

>Also, I didn't look too carefully, but it doesn't seem like your code
>really deals with circular references in a web site -- pages that link
>to each other could cause this to just go and go and go, no?  Same goes
>for links outside the site -- what causes it to stop crawling (I don't
>see what you do with the LEVEL argument, for instance)?

Yes it does look for this.  It keeps everything in a request scope
structure so that if the key (website URL) already exists in the structure
it doesn't respider that page.

Roland wrote:
>Also, there's no reason to pull down image files, etc. and look for links
in them since the content is binary, so >skip them!

Actually this is the main purpose of my app.  I need to create an inventory
of all the images on a site and show a listing of everyplace that image is
shown.  By hashing the image files I can hopefully identify the same image
that has been renamed and placed on a different part of the site.

Again thanks for the suggestions.

Cheers

Jason Cronk
[EMAIL PROTECTED]

----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of the
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]

----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to 
[email protected] with the words 'unsubscribe cfcdev' as the subject of the 
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting 
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at 
www.mail-archive.com/[email protected]

RE: [CFCDev] Spider

Reply via email to