RE: CFMX - best way to strip content from html page

Hugo Ahlenius Wed, 11 Feb 2004 10:44:17 -0800

Do you have any control over the sites archived? Are they XHTML?

If yes on the first question: enclose the relevant "main body" in div
tags, with a relevant ID. Then, if they are XHTML, you can parse it as
XML. If not it might still be possible to apply some regexps to get the
content (locate the start and end of the main body).

/Hugo

-------------------------------------------------------------
Hugo Ahlenius                  E-Mail: [EMAIL PROTECTED]
Project Officer                Phone:            +46 8 230460
UNEP GRID-Arendal              Fax:              +46 8 230441
Stockholm Office               Mobile:         +46 733 467111
                               WWW:       http://www.grida.no
-------------------------------------------------------------

| -----Original Message-----
| From: Rob Rohan [mailto:[EMAIL PROTECTED]
| Sent: Monday, February 09, 2004 17:18
| To: CF-Talk
| Subject: CFMX - best way to strip content from html page
|
| Hey there hi there ho there,
|
| I was wondering what others have used to strip the content
| out of web pages? I am working on a system that collects
| pages and archives them; however, only the content needs to
| be stored (i.e. not the navigation, images, extra page fodder).
|
| The sites it is archiving are vast so it would have to rather
| generic solution. I have seen this kind of thing before, but
| only for single specific sites. Does anyone know a good
| method to do it generically?
|
| I was leaning toward one of these but I am open to whatever
|
| * run the collected html through tidy (or jtidy) then
| (somehow) use xslt
| * (somehow) use a regular _expression_ on the collected html
|
| if anyone has done this before please let me know of pitfalls
| or recommendations - BTW I have time not money so any pay
| solutions are right out.
|
| Thanks
|
| --
| Vale,
| Rob
|
| Luxuria immodica insaniam creat.
| Sanam formam viatae conservate!
|
| http://www.rohanclan.com
| http://treebeard.sourceforge.net
| http://ashpool.sourceforge.net
|
|

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

RE: CFMX - best way to strip content from html page

Reply via email to