Re: CFMX Spidering for cache

Pete Ruckelshaus Tue, 18 Jun 2002 04:39:20 -0700

Oops, sorry.  That's what I get for not making you guys wade through a very
large page of crap code.


<CFSCRIPT>
 // de-dupe
 function DeDupe(list,type) {
  return REReplaceNoCase(ListSort(list,type),"([^,]+)(,\1)*","\1","ALL");
 }
 function msToSec(tick) {
  return numberFormat(tick / 1000, "9999.9");
 }
</CFSCRIPT>


----- Original Message -----
From: "Robert Everland" <[EMAIL PROTECTED]>
To: "CF-Talk" <[EMAIL PROTECTED]>
Sent: Monday, June 17, 2002 3:49 PM
Subject: RE: CFMX Spidering for cache


> I am playing with the code, looks like you are using a function called
> dedupe, do you have this?
>
> Robert Everland III
> Web Developer Extraordinaire
> Dixon Ticonderoga Company
> http://www.dixonusa.com
>
> -----Original Message-----
> From: Pete Ruckelshaus [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 17, 2002 3:13 PM
> To: CF-Talk
> Subject: Re: CFMX Spidering for cache
>
>
> Here's a bit of code I wrote (well, it's half-complete, but does what I
need
> it to do, which is spider the site and preload the CF Cache). Pardon the
> ugliness, you'll probably have to define a couple of variables and create
a
> form interface for this, but it's the result of more than a couple of
hours
> of work and should be enough to get you started. You could start with this
> and set up an app variable that if it isn't present, could run this
> script...so it gets run when the service is restarted:
>
> Pete
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> <cfset preloadStart = getTickCount()>
> Finding URL's from <cfoutput>#URL.startURL#</cfoutput> and
preloading....<br
> /><br />  <cfhttp url="#URL.startURL#" method="get"
> resolveurl="true"></cfhttp>  <cfparam name="seed_list" default="">
<cfparam
> name="ptr" default="1">  <cfloop condition="ptr LT len(cfhttp.fileContent)
> AND ptr GT 0">
>    <cfset
>
hit=REFind("(http://[a-zA-Z0-9\.\/\:\-\+\?\&\_\%\=\-]+)",cfhttp.fileContent,
> ptr, "true")>
>   <cfif hit.pos[1] GT 0>
>    <cfset link_found = Mid(cfhttp.fileContent,hit.pos[1],hit.len[1])>
>     <cfif (findNoCase("#URL.startURL#", link_found) AND
(findNoCase(".htm",
> link_found))) AND NOT findNoCase("/tools/", link_found)>
>      <cfset seed_list = ListAppend(seed_list, link_found)>
>     </cfif>
>    <cfset ptr = hit.pos[1] + 1>
>   <cfelse>
>    <cfset ptr = 0>
>   </cfif>
>  </cfloop>
> <cfset seed_list = #listSort(seed_list, "TextNoCase")#>
> <cfif len(trim(seed_list)) GT 0>
> <b>Seed list generated...</b>
> <cfoutput><ol>
>  <cfloop index="i" list="#seed_list#" delimiters=","><li>#i#</li></cfloop>
> </ol></cfoutput>
> <cfelse>
>  NO URL's found, aborting...
>  <cfabort>
> </cfif>
> <hr>
> <ol>
>  <li>At this point, we have a list of starting URL's. [done]</li>  <li>Set
a
> variable called seed_list [done]</li>  <li>if it's the first loop
iteration,
> use seed_list [done]</li>  <li>If it's in subsequent iteration, use
> temp_list</li>  <li>at the end of each loop, save 2 variables --
full_list,
> which is ALL of the URL's that just got spidered -- temp_list which is
> url_list with the contents of full_list removed so each page only gets
> spidered once.</li>  <li>after the loops have run, take the contents of
> url_list and save it to a text file.</li> </ol> <hr> <cfset loopCount =
"1">
> <cfloop index="i" from="1" to="#numLoops#">
>   <cfif loopCount IS "1">
>    <!--- set temp_urls to seed_list and use that to surf --->
>    <cfset processed_urls = "http://localhost/default.cfm";>
>    <!--- set good_urls to seed_list and use that to store all good values
> --->
>    <cfset in_process_urls = #seed_list#>
>   </cfif>
>  <cfoutput>
>  <h3>Loop #i#, processing #listLen(in_process_urls)# URL's
> (#loopCount#)</h3>  <ol>
>   <cfset to_do_urls = "">
>   <cfloop list=#in_process_urls# index="url">
>   <li>#url# spidered...</li>
>   <cfset finalCount = #deDupe(in_process_urls,"text")#>
>   <cfhttp url="#url#" method="get" resolveurl="true"></cfhttp>
>   <cfset ptr = 1>
>    <cfloop condition="ptr LT len(cfhttp.fileContent) AND ptr GT 0">
>     <cfset
>
hit=REFind("(http://[a-zA-Z0-9\.\/\:\-\+\?\&\_\%\=\-]+)",cfhttp.fileContent,
> ptr, "true")>
>     <cfif hit.pos[1] GT 0>
>      <cfset link_found = Mid(cfhttp.fileContent,hit.pos[1],hit.len[1])>
>        <cfif (findNoCase("#URL.startURL#", link_found) AND
> (findNoCase(".htm", link_found))) AND NOT findNoCase("/tools/",
link_found)>
>
>       <cfif listContainsNoCase( processed_urls, link_found ) EQ 0>
>        <cfset to_do_urls = ListAppend(to_do_urls, link_found)>
>       </cfif>
>
>      </cfif>
>      <cfset ptr = hit.pos[1] + 1>
>     <cfelse>
>      <cfset ptr = 0>
>     </cfif>
>     </cfloop>
>   </cfloop>
>  </ol>
>   </cfoutput>
>
>   <cfset processed_urls = ListAppend(processed_urls,
> #deDupe(in_process_urls,"text")#)>
>   <cfset in_process_urls = #deDupe(to_do_urls,"text")#>
>   <cfset loopCount = #loopCount# + 1>
>   <table>
>    <tr valign="top">
>     <td>Processed URL's:<ol><cfoutput><cfloop index="i"
> list="#processed_urls#"
> delimiters=","><li>#i#</li></cfloop></cfoutput></ol></td>
>     <td>To Do URL's:<ol><cfoutput><cfloop index="i"
list="#in_process_urls#"
> delimiters=","><li>#i#</li></cfloop></cfoutput></ol></td>
>    </tr>
>   </table>
> </cfloop>
>
>  <h3>Spidering complete.</h3>
>  <cfset preloadFinish = getTickCount()>
>  <cfset preloadTime = preloadFinish - preloadStart>
>  <cfset preloadTimeSec = preloadtime / 1000>
>  <cfset preLoadTimeMin = preLoadTimeSec / 60>
>  <cfset secMod = (preLoadTimeSec mod 60)>
>
>  <cfoutput>#listLen(processed_urls)# URL's processed in <cfif
preloadTimeSec
> LT 60>#msTosec(preLoadTime)# Seconds<cfelse>#numberFormat(preloadTimeMin,
> "9999")# Minutes and #numberFormat(secMod, "9999.9")# Seconds.</cfif> <a
> href="?">Return</a>.</cfoutput>
>
>
>
>
>
>
> ----- Original Message -----
> From: "Robert Everland" <[EMAIL PROTECTED]>
> To: "CF-Talk" <[EMAIL PROTECTED]>
> Sent: Monday, June 17, 2002 2:40 PM
> Subject: RE: CFMX Spidering for cache
>
>
> > Ehhhh I was hoping for there to be a cf solution, becauise say the
> > server reboots I now have to rely on something external to make sure
> > no one gets
> a
> > slow application.
> >
> > Robert Everland III
> > Web Developer Extraordinaire
> > Dixon Ticonderoga Company
> > http://www.dixonusa.com
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, June 17, 2002 2:40 PM
> > To: CF-Talk
> > Subject: RE: CFMX Spidering for cache
> >
> >
> > Robert, while not written or designed for the task I use a product
> > called Black Widow, it is a site grabber, but works very well at doing
> > exactly
> what
> > you want to do, and if I remember right the price was right around $30
> when
> > I bought my copy of it...
> >
> > HTH,
> > John
> >
> > -----Original Message-----
> > From: Robert Everland [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, June 17, 2002 2:29 PM
> > To: CF-Talk
> > Subject: CFMX Spidering for cache
> >
> >
> > Does anyone know if there is a version of CFMX that offers a spider or
> > a way to compile the webpages so that there isn't a huge latency when
> > someone goes to the site for the first time?
> >
> > Robert Everland III
> > Web Developer Extraordinaire
> > Dixon Ticonderoga Company
> > http://www.dixonusa.com
> >
> >
> >
>
> 
______________________________________________________________________
This list and all House of Fusion resources hosted by CFHosting.com. The place for 
dependable ColdFusion Hosting.
FAQ: http://www.thenetprofits.co.uk/coldfusion/faq
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

Re: CFMX Spidering for cache

Reply via email to