Instead of hashing the documents then, just use the HEAD command for images. It will return the last modified date - use this to track file changes instead. HEAD was designed to do this so that you don't waste other people's bandwidth ;)
"HEAD: identical to the GET method, but the server does not send a message body in the response. Use this method for testing hypertext links for validity and accessibility, determining the type or modification time of a document, or determining the type of server." http://livedocs.macromedia.com/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/h tml/wwhelp.htm?context=ColdFusion_Documentation&file=00000272.htm You will still need to use GET for textual docs so you can spider the links, but HEAD will work just fine for images, etc. Instead of comparing hashes, always compare the modified date. Cheers, Roland -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Monday, June 06, 2005 10:13 AM To: [email protected] Subject: Re: [CFCDev] Spider Thanks for the great replies folks. Nathan wrote: >Well, the first suggestion is don't use ColdFusion for this -- not >really the best tool for the job. Well when the only tool you have is a hammer...... I don't know perl or I'd probably do it in that. >Not sure why you necessarily need to keep a hash of the file contents, >but that is likely slow for big chunks of HTML. I need to monitor the site for changes. If a page changes then the hash will change too and I'll keep a new copy. >You might find a simple regex to find links is faster than all the >string manipulation you are doing. I definately will do this. >Also, I didn't look too carefully, but it doesn't seem like your code >really deals with circular references in a web site -- pages that link >to each other could cause this to just go and go and go, no? Same goes >for links outside the site -- what causes it to stop crawling (I don't >see what you do with the LEVEL argument, for instance)? Yes it does look for this. It keeps everything in a request scope structure so that if the key (website URL) already exists in the structure it doesn't respider that page. Roland wrote: >Also, there's no reason to pull down image files, etc. and look for links in them since the content is binary, so >skip them! Actually this is the main purpose of my app. I need to create an inventory of all the images on a site and show a listing of everyplace that image is shown. By hashing the image files I can hopefully identify the same image that has been renamed and placed on a different part of the site. Again thanks for the suggestions. Cheers Jason Cronk [EMAIL PROTECTED] ---------------------------------------------------------- You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as the subject of the email. CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com). CFCDev is supported by New Atlanta, makers of BlueDragon http://www.newatlanta.com/products/bluedragon/index.cfm An archive of the CFCDev list is available at www.mail-archive.com/[email protected] ---------------------------------------------------------- You are subscribed to cfcdev. To unsubscribe, send an email to [email protected] with the words 'unsubscribe cfcdev' as the subject of the email. CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting (www.cfxhosting.com). CFCDev is supported by New Atlanta, makers of BlueDragon http://www.newatlanta.com/products/bluedragon/index.cfm An archive of the CFCDev list is available at www.mail-archive.com/[email protected]
