Thanks very much, by the way! :)
----- Original Message -----
From: Rob Rohan
To: CF-Talk
Sent: Thursday, February 19, 2004 4:02 PM
Subject: Re: Extracting Text From web page
On Thu, 2004-02-19 at 13:48, brobborb wrote:
> Just the text. no HTML stuff :)
I have been working on a project to do just that. I have made some
progress (but its not perfect yet)
What I have been doing is using cfhttp to get the html save it to a
file, then send the html through jtidy to make the html xml. Then I have
been using xslt to get the information, you can just load it as a cf xml
object though.
The only weak point is the http->xml. It works about 70 percent of the
time - depending on how wack the sites html is.
If you want to play along:
download jtidy:
http://jdity.sf.net and put it in your cf classpath
then you can use it like so:
<cfscript>
testfile ="#request.physicalroot#\engine\cache\test.html";
objtidy = createObject("java","org.w3c.tidy.Tidy");
objtidy.setXmlOut(true);
objtidy.setWrapSection(true);
objtidy.setWrapScriptlets(true);
//objtidy.setWrapJste(true);
objtidy.setWord2000(true);
objtidy.setTidyMark(true);
objtidy.setQuoteMarks(true);
objtidy.setQuoteNbsp(true);
objtidy.setMakeClean(true);
objtidy.setNumEntities(true);
objtidy.setDropFontTags(true);
objtidy.setDropEmptyParas(true);
objtidy.setXmlTags(true);
fileis = createObject("java","java.io.FileInputStream");
byteos = createObject("java","java.io.ByteArrayOutputStream");
byteos.init();
fileis.init("#testfile#");
tidyDOM = objtidy.parseDOM(fileis, byteos);
</cfscript>
<br>
Errors: <cfoutput>#objtidy.getParseErrors()#</cfoutput>
<br>
<!--- this should be a cold fusion xml structure --->
<cfdump var="#xmlparse(byteos.toString())#">
the xslt is not 100% yet but if you get //p then you'll get the
important text 90% of the time... still working on it, but this works
more generically then regex's
--
Vale,
Rob
Luxuria immodica insaniam creat.
Sanam formam viatae conservate!
http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

