Re: Extracting Text From web page

Rob Rohan Thu, 19 Feb 2004 14:09:25 -0800

On Thu, 2004-02-19 at 13:48, brobborb wrote:
> Just the text. no HTML stuff :)

I have been working on a project to do just that. I have made some
progress (but its not perfect yet)

What I have been doing is using cfhttp to get the html save it to a
file, then send the html through jtidy to make the html xml. Then I have
been using xslt to get the information, you can just load it as a cf xml
object though.

The only weak point is the http->xml. It works about 70 percent of the
time - depending on how wack the sites html is.

If you want to play along:
download jtidy:
http://jdity.sf.net and put it in your cf classpath

then you can use it like so:
<cfscript>
testfile ="#request.physicalroot#\engine\cache\test.html";

objtidy = createObject("java","org.w3c.tidy.Tidy");

objtidy.setXmlOut(true);
objtidy.setWrapSection(true);
objtidy.setWrapScriptlets(true);
//objtidy.setWrapJste(true);
objtidy.setWord2000(true);
objtidy.setTidyMark(true);
objtidy.setQuoteMarks(true);
objtidy.setQuoteNbsp(true);
objtidy.setMakeClean(true);
objtidy.setNumEntities(true);
objtidy.setDropFontTags(true);
objtidy.setDropEmptyParas(true);

objtidy.setXmlTags(true);

fileis = createObject("java","java.io.FileInputStream");

byteos = createObject("java","java.io.ByteArrayOutputStream");
byteos.init();

fileis.init("#testfile#");

tidyDOM = objtidy.parseDOM(fileis, byteos);
</cfscript>

<br>
Errors: <cfoutput>#objtidy.getParseErrors()#</cfoutput>
<br>


<cfdump var="#xmlparse(byteos.toString())#">

the xslt is not 100% yet but if you get //p then you'll get the
important text 90% of the time... still working on it, but this works
more generically then regex's

--
Vale,
Rob

Luxuria immodica insaniam creat.
Sanam formam viatae conservate!

http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Re: Extracting Text From web page

Reply via email to