Rob,


I just went through this situation while building my new webmail app.  And I
have to agree with the others... It's not an easy thing to strip html and
just leave the content. I had to use several different functions to get the
job done:


1) Replace all BR and P tags with CR/LF's (thanks to Ben Forta for his
ParagraphFormat2 UDF!)
2) Strip all scripts, applets and objects.
3) Strip all HTML tags.


The most important thing I learned from all of this is that most situations
are unique. Depending on your application, you will need to strip specific
portions of a page. You said that you need a generic solution, but I don't
think that is possible.


The best solution for you would probably be to develop maps for each site.


1) Examine a site and and create a set of regex's that, when run in order,
will give you the result you want.
2) Store the maps in a file or db table.
3) Run a query to get the map for the site.
4) Do a CFHTTP to get the initial content
5) Loop through the map query and do REREPLACE's on the content.
6) Save the result


HTH

--

Michael Wolfe
[EMAIL PROTECTED]


  _____  

From: Rob Rohan [mailto:[EMAIL PROTECTED]
Sent: Monday, February 09, 2004 9:21 AM
To: CF-Talk
Subject: Re: CFMX - best way to strip content from html page

On Mon, 2004-02-09 at 09:06, Tyler Clendenin wrote:
>My only recommendation would be difficult.
>You would have to build your own application for comparison of code and
>strip out everything that is similar (you would have to decide on the
>rules).  

Comparison of code? Meaning look at what a typical anchor tag looks
like, typical _javascript_, etc? Ah, so you are suggesting *removing* what
is bad not getting what is good... interesting... I'll muddle that one
over - thanks Tyler

--
Vale,
Rob

Luxuria immodica insaniam creat.
Sanam formam viatae conservate!

http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to