Re: CFMX - best way to strip content from html page

Michael Dinowitz Mon, 09 Feb 2004 09:29:01 -0800

No. Bottom line here is just no. There is no generic way of getting content
off a web page without getting the non-content (nav, etc.). You can strip
the HTML tags from a page but that dies vs. a page with CSS navigation. All
you can do it build a specific spider for a specific site type. There's
really no way around it.
I've got dozens of spiders and the only thing generic about them are the way
that it gets the content, not how it parses it (the actual regexs).

> Hey there hi there ho there,
>
> I was wondering what others have used to strip the content out of web
> pages? I am working on a system that collects pages and archives them;
> however, only the content needs to be stored (i.e. not the navigation,
> images, extra page fodder).
>
> The sites it is archiving are vast so it would have to rather generic
> solution. I have seen this kind of thing before, but only for single
> specific sites. Does anyone know a good method to do it generically?
>
> I was leaning toward one of these but I am open to whatever
>
> * run the collected html through tidy (or jtidy) then (somehow) use xslt
> * (somehow) use a regular _expression_ on the collected html
>
> if anyone has done this before please let me know of pitfalls or
> recommendations - BTW I have time not money so any pay solutions are
> right out.
>
> Thanks
>
> --
> Vale,
> Rob
>
> Luxuria immodica insaniam creat.
> Sanam formam viatae conservate!
>
> http://www.rohanclan.com
> http://treebeard.sourceforge.net
> http://ashpool.sourceforge.net
>
>

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Re: CFMX - best way to strip content from html page

Reply via email to