Sorry if I wasn't clear but unfortunately, that won't work for my needs.  I
need to be able to identify if two image files are the same.  For instance
if one has

http://www.mydomain.com/images/5684.jpg

and

http://mydomain.com/testSite/images/1300.jpg

and really those images are the same, I need to identify that so that when
I cross reference all occurances of the image 5684.jpg in the website I
also identify it showing up on pages as
http://mydomain.com/testSite/images/1300.jpg

Now I haven't actually tested if CF will hash those two binaries the same
and I'm looking at a more robust Custom Tag that hashes files.

Jason Cronk
[EMAIL PROTECTED]




                                                                                
                                                       
                      "Roland Collins"                                          
                                                       
                      <[EMAIL PROTECTED]        To:       [email protected]    
                                                        
                      .com>                    cc:                              
                                                       
                      Sent by:                 Subject:  RE: [CFCDev] Spider    
                                                       
                      [EMAIL PROTECTED]                                         
                                                       
                      one.org                                                   
                                                       
                                                                                
                                                       
                                                                                
                                                       
                      06/06/2005 02:21                                          
                                                       
                      PM                                                        
                                                       
                      Please respond to                                         
                                                       
                      CFCDev                                                    
                                                       
                                                                                
                                                       
                                                                                
                                                       




Instead of hashing the documents then, just use the HEAD command for
images.
It will return the last modified date - use this to track file changes
instead.  HEAD was designed to do this so that you don't waste other
people's bandwidth ;)

"HEAD: identical to the GET method, but the server does not send a message
body in the response. Use this method for testing hypertext links for
validity and accessibility, determining the type or modification time of a
document, or determining the type of server."

http://livedocs.macromedia.com/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/h

tml/wwhelp.htm?context=ColdFusion_Documentation&file=00000272.htm

You will still need to use GET for textual docs so you can spider the
links,
but HEAD will work just fine for images, etc.  Instead of comparing hashes,
always compare the modified date.

Cheers,
Roland


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of [EMAIL PROTECTED]
Sent: Monday, June 06, 2005 10:13 AM
To: [email protected]
Subject: Re: [CFCDev] Spider





Thanks for the great replies folks.

Nathan wrote:
>Well, the first suggestion is don't use ColdFusion for this -- not
>really the best tool for the job.

Well when the only tool you have is a hammer......  I don't know perl or
I'd probably do it in that.

>Not sure why you necessarily need to keep a hash of the file contents,
>but that is likely slow for big chunks of HTML.

I need to monitor the site for changes.  If a page changes then the hash
will change too and I'll keep a new copy.


>You might find a simple regex to find links is faster than all the
>string manipulation you are doing.

I definately will do this.

>Also, I didn't look too carefully, but it doesn't seem like your code
>really deals with circular references in a web site -- pages that link
>to each other could cause this to just go and go and go, no?  Same goes
>for links outside the site -- what causes it to stop crawling (I don't
>see what you do with the LEVEL argument, for instance)?

Yes it does look for this.  It keeps everything in a request scope
structure so that if the key (website URL) already exists in the structure
it doesn't respider that page.

Roland wrote:
>Also, there's no reason to pull down image files, etc. and look for links
in them since the content is binary, so >skip them!

Actually this is the main purpose of my app.  I need to create an inventory
of all the images on a site and show a listing of everyplace that image is
shown.  By hashing the image files I can hopefully identify the same image
that has been renamed and placed on a different part of the site.

Again thanks for the suggestions.

Cheers

Jason Cronk
[EMAIL PROTECTED]





----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of
the
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]







----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to
[email protected] with the words 'unsubscribe cfcdev' as the subject of
the email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at
www.mail-archive.com/[email protected]









----------------------------------------------------------
You are subscribed to cfcdev. To unsubscribe, send an email to 
[email protected] with the words 'unsubscribe cfcdev' as the subject of the 
email.

CFCDev is run by CFCZone (www.cfczone.org) and supported by CFXHosting 
(www.cfxhosting.com).

CFCDev is supported by New Atlanta, makers of BlueDragon
http://www.newatlanta.com/products/bluedragon/index.cfm

An archive of the CFCDev list is available at 
www.mail-archive.com/[email protected]


Reply via email to