this is a bit radical and probably pretty frigin slow but I thought I saw
some website that would take a pic of your site.  Then maybe you could ocr
the image dynamically??


DRE

-----Original Message-----
From: Rob Rohan [mailto:[EMAIL PROTECTED]
Sent: Monday, February 09, 2004 9:18 AM
To: CF-Talk
Subject: CFMX - best way to strip content from html page

Hey there hi there ho there,

I was wondering what others have used to strip the content out of web
pages? I am working on a system that collects pages and archives them;
however, only the content needs to be stored (i.e. not the navigation,
images, extra page fodder).

The sites it is archiving are vast so it would have to rather generic
solution. I have seen this kind of thing before, but only for single
specific sites. Does anyone know a good method to do it generically?

I was leaning toward one of these but I am open to whatever

* run the collected html through tidy (or jtidy) then (somehow) use xslt
* (somehow) use a regular _expression_ on the collected html

if anyone has done this before please let me know of pitfalls or
recommendations - BTW I have time not money so any pay solutions are
right out.

Thanks

--
Vale,
Rob

Luxuria immodica insaniam creat.
Sanam formam viatae conservate!

http://www.rohanclan.com <http://www.rohanclan.com>
http://treebeard.sourceforge.net <http://treebeard.sourceforge.net>
http://ashpool.sourceforge.net <http://ashpool.sourceforge.net>  
  _____
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to