Re: CFMX - best way to strip content from html page

Tyler Clendenin Mon, 09 Feb 2004 10:07:17 -0800

My only recommendation would be difficult.  You would have to build your own application for comparison of code and strip out everything that is similar (you would have to decide on the rules).  This is of course no easy task and I would probably not tackle something like that in coldfusion.  As far as web languages are concerned perl is probably the best bet but in reality I would write a separate application to do this sort of thing.

Tyler Clendenin
GSL Solutions
  ----- Original Message -----
  From: Rob Rohan
  To: CF-Talk
  Sent: Monday, February 09, 2004 11:17 AM
  Subject: CFMX - best way to strip content from html page

  Hey there hi there ho there,

  I was wondering what others have used to strip the content out of web
  pages? I am working on a system that collects pages and archives them;
  however, only the content needs to be stored (i.e. not the navigation,
  images, extra page fodder).

  The sites it is archiving are vast so it would have to rather generic
  solution. I have seen this kind of thing before, but only for single
  specific sites. Does anyone know a good method to do it generically?

  I was leaning toward one of these but I am open to whatever

  * run the collected html through tidy (or jtidy) then (somehow) use xslt
  * (somehow) use a regular _expression_ on the collected html

  if anyone has done this before please let me know of pitfalls or
  recommendations - BTW I have time not money so any pay solutions are
  right out.

  Thanks

  --
  Vale,
  Rob

  Luxuria immodica insaniam creat.
  Sanam formam viatae conservate!

  http://www.rohanclan.com
  http://treebeard.sourceforge.net
  http://ashpool.sourceforge.net

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Re: CFMX - best way to strip content from html page

Reply via email to