Here's what I posted about a month ago:

I made a set up using an idea first perpetrated by Eron Cohen on this list. Basically, 
it uses a query of a simple database containing three columns: ID, URL, and Title. I 
then loop over the URLs using <cfhttp> and strip out all the HTML and JavaScript 
stuff. Then I create a query and populate it with the page. I continue until I've 
looped over all of the pages in the database. Then I use a Verity indexing of a query 
to put it in the collection. It takes awhile (34966ms) so I have it as a scheduled 
process late at night. It currently works very well on our intranet. I don't know how 
well it would scale. Here's what I use:

<cfinclude template="/cuonline/app_globals.cfm">

<cfquery name="qry_GetFuseactions" datasource="searchindex">
        SELECT  *
        FROM            SearchIndex
</cfquery>

<cfset qry_Faker=QueryNew("id,url,title,content,keywords")>

<cfloop query="qry_GetFuseactions">
        <cfset str_URL=request.webroot&url>
        <cfif FindNoCase(".doc",url) OR FindNoCase(".xls",url)> <!--- Don't index if a 
Word or Excel file --->
                <cfscript>
                        tmp_foo=QueryAddRow(qry_Faker);
                        tmp_foo=QuerySetCell(qry_Faker,"id",id);
                        tmp_foo=QuerySetCell(qry_Faker,"url",str_URL);
                        tmp_foo=QuerySetCell(qry_Faker,"title",title);
                        tmp_foo=QuerySetCell(qry_Faker,"content","");
                        tmp_foo=QuerySetCell(qry_Faker,"keywords",keywords);
                </cfscript>
        <cfelse>
                <cfhttp method="get" url="#str_URL#">
        
                <cftry>
                        <cfscript>
                                
int_FirstTable=FindNoCase("</table>",CFHTTP.FileContent);
                                int_FirstTable=int_FirstTable+8;
                                
int_NextTable=FindNoCase("</table>",CFHTTP.FileContent,int_FirstTable);
                                int_NextTable=int_NextTable+8;
                                
int_LastTable=FindNoCase("</table>",CFHTTP.FileContent,int_NextTable);
                                int_LastTable=int_LastTable+8;
                                int_StartFooter=FindNoCase("<OBJECT 
class",CFHTTP.FileContent);
                                
int_EndFooter=FindNoCase("</html>",CFHTTP..FileContent);
                                int_FirstPart=int_LastTable+8;
                                int_LastPart=(int_EndFooter+7)-int_StartFooter;

                                
str_FileContent=RemoveChars(CFHTTP.FileContent,1,int_FirstPart);
                                
str_FileContent=REReplaceNoCase(str_FileContent,"<script[^>]+>(.)*</script>","","ALL");
                                
str_FileContent=REReplaceNoCase(str_FileContent,"<[^>]+>","","ALL");
                                tmp_foo=QueryAddRow(qry_Faker);
                                tmp_foo=QuerySetCell(qry_Faker,"id",id);
                                tmp_foo=QuerySetCell(qry_Faker,"url",str_URL);
                                tmp_foo=QuerySetCell(qry_Faker,"title",title);
                                
tmp_foo=QuerySetCell(qry_Faker,"content",str_FileContent);
                                tmp_foo=QuerySetCell(qry_Faker,"keywords",keywords);
                        </cfscript>
                        <cfcatch type="Any">
                                <cfmail to="[EMAIL PROTECTED]" 
from="[EMAIL PROTECTED]"
                                subject="Indexer Error">Type: #cfcatch.type#
                                Message: #cfcatch.message#
                                Detail: #cfcatch.detail#</cfmail>
                        </cfcatch>
                </cftry>
        </cfif>
</cfloop>

<cfindex collection="cuonlineindex" action="purge">
<cfindex collection="cuonlineindex" action="refresh" type="custom" 
body="content,keywords" key="id" title="title" custom1="url" query="qry_Faker">
<cfcollection action="optimize" collection="cuonlineindex">


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

Reply via email to