If yes on the first question: enclose the relevant "main body" in div
tags, with a relevant ID. Then, if they are XHTML, you can parse it as
XML. If not it might still be possible to apply some regexps to get the
content (locate the start and end of the main body).
/Hugo
-------------------------------------------------------------
Hugo Ahlenius E-Mail: [EMAIL PROTECTED]
Project Officer Phone: +46 8 230460
UNEP GRID-Arendal Fax: +46 8 230441
Stockholm Office Mobile: +46 733 467111
WWW: http://www.grida.no
-------------------------------------------------------------
| -----Original Message-----
| From: Rob Rohan [mailto:[EMAIL PROTECTED]
| Sent: Monday, February 09, 2004 17:18
| To: CF-Talk
| Subject: CFMX - best way to strip content from html page
|
| Hey there hi there ho there,
|
| I was wondering what others have used to strip the content
| out of web pages? I am working on a system that collects
| pages and archives them; however, only the content needs to
| be stored (i.e. not the navigation, images, extra page fodder).
|
| The sites it is archiving are vast so it would have to rather
| generic solution. I have seen this kind of thing before, but
| only for single specific sites. Does anyone know a good
| method to do it generically?
|
| I was leaning toward one of these but I am open to whatever
|
| * run the collected html through tidy (or jtidy) then
| (somehow) use xslt
| * (somehow) use a regular _expression_ on the collected html
|
| if anyone has done this before please let me know of pitfalls
| or recommendations - BTW I have time not money so any pay
| solutions are right out.
|
| Thanks
|
| --
| Vale,
| Rob
|
| Luxuria immodica insaniam creat.
| Sanam formam viatae conservate!
|
| http://www.rohanclan.com
| http://treebeard.sourceforge.net
| http://ashpool.sourceforge.net
|
|
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

