Dude whatever happened to simplicity!  lol i was only expecting like 4 or 5 lines of code.

Thanks very much, by the way! :)
  ----- Original Message -----
  From: Rob Rohan
  To: CF-Talk
  Sent: Thursday, February 19, 2004 4:02 PM
  Subject: Re: Extracting Text From web page

  On Thu, 2004-02-19 at 13:48, brobborb wrote:
  > Just the text.  no HTML stuff :)

  I have been working on a project to do just that. I have made some
  progress (but its not perfect yet)

  What I have been doing is using cfhttp to get the html save it to a
  file, then send the html through jtidy to make the html xml. Then I have
  been using xslt to get the information, you can just load it as a cf xml
  object though.

  The only weak point is the http->xml. It works about 70 percent of the
  time - depending on how wack the sites html is.

  If you want to play along:
  download jtidy:
  http://jdity.sf.net and put it in your cf classpath

  then you can use it like so:
  <cfscript>
  testfile ="#request.physicalroot#\engine\cache\test.html";

  objtidy = createObject("java","org.w3c.tidy.Tidy");

  objtidy.setXmlOut(true);
  objtidy.setWrapSection(true);
  objtidy.setWrapScriptlets(true);
  //objtidy.setWrapJste(true);
  objtidy.setWord2000(true);
  objtidy.setTidyMark(true);
  objtidy.setQuoteMarks(true);
  objtidy.setQuoteNbsp(true);
  objtidy.setMakeClean(true);
  objtidy.setNumEntities(true);
  objtidy.setDropFontTags(true);
  objtidy.setDropEmptyParas(true);

  objtidy.setXmlTags(true);

  fileis = createObject("java","java.io.FileInputStream");

  byteos = createObject("java","java.io.ByteArrayOutputStream");
  byteos.init();

  fileis.init("#testfile#");

  tidyDOM = objtidy.parseDOM(fileis, byteos);
  </cfscript>

  <br>
  Errors: <cfoutput>#objtidy.getParseErrors()#</cfoutput>
  <br>

  <!--- this should be a cold fusion xml structure --->
  <cfdump var="#xmlparse(byteos.toString())#">

  the xslt is not 100% yet but if you get //p then you'll get the
  important text 90% of the time... still working on it, but this works
  more generically then regex's

  --
  Vale,
  Rob

  Luxuria immodica insaniam creat.
  Sanam formam viatae conservate!

  http://www.rohanclan.com
  http://treebeard.sourceforge.net
  http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to