Here's what I posted about a month ago:
I made a set up using an idea first perpetrated by Eron Cohen on this list. Basically,
it uses a query of a simple database containing three columns: ID, URL, and Title. I
then loop over the URLs using <cfhttp> and strip out all the HTML and JavaScript
stuff. Then I create a query and populate it with the page. I continue until I've
looped over all of the pages in the database. Then I use a Verity indexing of a query
to put it in the collection. It takes awhile (34966ms) so I have it as a scheduled
process late at night. It currently works very well on our intranet. I don't know how
well it would scale. Here's what I use:
<cfinclude template="/cuonline/app_globals.cfm">
<cfquery name="qry_GetFuseactions" datasource="searchindex">
SELECT *
FROM SearchIndex
</cfquery>
<cfset qry_Faker=QueryNew("id,url,title,content,keywords")>
<cfloop query="qry_GetFuseactions">
<cfset str_URL=request.webroot&url>
<cfif FindNoCase(".doc",url) OR FindNoCase(".xls",url)> <!--- Don't index if a
Word or Excel file --->
<cfscript>
tmp_foo=QueryAddRow(qry_Faker);
tmp_foo=QuerySetCell(qry_Faker,"id",id);
tmp_foo=QuerySetCell(qry_Faker,"url",str_URL);
tmp_foo=QuerySetCell(qry_Faker,"title",title);
tmp_foo=QuerySetCell(qry_Faker,"content","");
tmp_foo=QuerySetCell(qry_Faker,"keywords",keywords);
</cfscript>
<cfelse>
<cfhttp method="get" url="#str_URL#">
<cftry>
<cfscript>
int_FirstTable=FindNoCase("</table>",CFHTTP.FileContent);
int_FirstTable=int_FirstTable+8;
int_NextTable=FindNoCase("</table>",CFHTTP.FileContent,int_FirstTable);
int_NextTable=int_NextTable+8;
int_LastTable=FindNoCase("</table>",CFHTTP.FileContent,int_NextTable);
int_LastTable=int_LastTable+8;
int_StartFooter=FindNoCase("<OBJECT
class",CFHTTP.FileContent);
int_EndFooter=FindNoCase("</html>",CFHTTP..FileContent);
int_FirstPart=int_LastTable+8;
int_LastPart=(int_EndFooter+7)-int_StartFooter;
str_FileContent=RemoveChars(CFHTTP.FileContent,1,int_FirstPart);
str_FileContent=REReplaceNoCase(str_FileContent,"<script[^>]+>(.)*</script>","","ALL");
str_FileContent=REReplaceNoCase(str_FileContent,"<[^>]+>","","ALL");
tmp_foo=QueryAddRow(qry_Faker);
tmp_foo=QuerySetCell(qry_Faker,"id",id);
tmp_foo=QuerySetCell(qry_Faker,"url",str_URL);
tmp_foo=QuerySetCell(qry_Faker,"title",title);
tmp_foo=QuerySetCell(qry_Faker,"content",str_FileContent);
tmp_foo=QuerySetCell(qry_Faker,"keywords",keywords);
</cfscript>
<cfcatch type="Any">
<cfmail to="[EMAIL PROTECTED]"
from="[EMAIL PROTECTED]"
subject="Indexer Error">Type: #cfcatch.type#
Message: #cfcatch.message#
Detail: #cfcatch.detail#</cfmail>
</cfcatch>
</cftry>
</cfif>
</cfloop>
<cfindex collection="cuonlineindex" action="purge">
<cfindex collection="cuonlineindex" action="refresh" type="custom"
body="content,keywords" key="id" title="title" custom1="url" query="qry_Faker">
<cfcollection action="optimize" collection="cuonlineindex">
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at
http://www.fusionauthority.com/bkinfo.cfm
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists