There's a built function at CFLib.org which will strip HTML tags:
http://cflib.org/udf/stripHTML

You could do a quick regex to get just the contents of the body tag, then
run that string through the StripHTML function. That'll give you any text
contained within HTML tags like <p>, <div>, etc.

At that point you could do whatever you liked with the result.


andy

-----Original Message-----
From: Anthony Webb [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 09, 2008 12:04 PM
To: CF-Talk
Subject: Extract text from webpage content using cfhttp

I need to index web page contents for doing verity (or similar) searching.
I'd like to insert just the text that a web page returns and not any of the
other stuff (like html, JS, CSS, images, etc)  

I noticed that cfhttp.filecontent returns the entire contents of the page,
anyone have a good way to get at just the text?

Also, I am storing the results in a mysql database and was anticipating
using the "text" data type, I assume that is the best way to go? 



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to 
date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:308821
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Reply via email to