Yes, you can do this. I'm working on a similar project
right now. It's easy to do for any given page, but it
is difficult to write generic code. My approach is to
parse a webpage, and then run each result through a
series of simple functions each one which tests for
something desired or not desired. After I have taken
what I want, I run it through a html preprocess
function and then the result can be used in generating
a web page.
To make dead links live I generally use split-path on
the main url but there are some exceptions. Some sites
require a function just to find the correct url for
the day.
I tie all these functions together with a master
function so I only need to create one function call to
process any given web page.
Because this function call can be complicated I'm
working on a tool that makes it easy to examine any
web page and then create the parameters needed to
build the function call.
A good example of this is the website
http://moreover.com. They extract headlines from news
pages and they produce excellent results. To cover
thousands of sites they must have good generic code.
My first website doing this is at
http://www.geocities.com/tamarind_climb
It has a couple of bugs but it has been working fairly
well. The problem with this is I have to write a new
page of code for each webpage I want to extract
headlines from and that's time consuming. That's why
I'm in the process of rewriting the code to be more
generic.
I think the key idea is to write simple functions that
only do one thing but when combined together have the
power to extract and reformat just about anything from
a web page. When and if I get further along in this I
will make my scripts available.
--- Terry Brownell <[EMAIL PROTECTED]> wrote:
>
> Goal - Reconstruct a previously read webpage prior
> to saving so that all tags are complete URLs
>
> Here's a project some may be interested in
> collaborating on for the good of Reboldom.
>
> The Problem.
>
> When an HTML document is read and then saved, many
> of the tags (src, a href etc) become "dead" due to
> the original page referencing a path to the local
> server directly, like so...
>
> <a href="/news/0-1006-200-5079991.html?tag=tp_pr">
>
> as opposed to the complete URL, thus...
>
> <a
>
href="http://www.news.com/news/0-1006-200-5079991.html?tag=tp_pr">
>
> When the page is then "delivered" outside of its
> domain, the resulting html is marred. This hinders
> webpage manipulation and must not be allowed to
> continue.
>
> The Solution
>
> Now lets say we could replace the "dead" (for lack
> of a proper definition) URLs with "well-formed"
> URLs, what would be some of the advantages?
>
> A few that come to mind include;
>
> - Reading a webpage, removing the javascript that
> "breaks" the page out of frames, then delivering it
> to a frame (sneaky huh?)
> - Removing/Replacing banner ads.
> - Marking up the page with XML on the fly
> - Annotating the page
> - Highlight key points
> etc.
>
> Now this seems like an easy task, but it's
> deceiving. One may say, "Just insert the domain
> part of the URL into the tags" (see my "been using
> rebol for months, but still green" script below)
> This works for basic sites, but as the HTML gets
> more and more complex, so the sophistication of
> function.
>
> For example, some of these "dead" tags get pretty
> wirey... some have a leading "/" and some don't,
> some are embedded into javascript, and many other
> styles.
>
> Is this idea too far fetched? Am I not seeing the
> forest for the trees? Is there already a solution?
>
> Your thoughts and input are much appreciated.
>
> Terry Brownell
> www.LFReD.com
>
> Below is the "It's Mine Now 1.0"
> (Note: I know this could be written much better, and
> as a minimum made into a function, but it's a start
> from a starter. Feel free to improve. Also I find
> laying the code out into long lines easier to follow
> and debug. Don't ask me why, maybe cuz I'm
> Canadian.)
>
> rebol []
>
> the-domain: to-url ask "What domain?"
> the-markup: load/markup the-domain
>
> ;The following will check for "dead" SRCs, if true
> then add the domain
>
> forall the-markup [if all [(type? first the-markup)
> = tag! found? find first the-markup {src="} not
> found? find first the-markup "://"][insert find/tail
> first the-markup {src="} the-domain]]
>
> ;The following will check for "dead" HREFs and
> replace with domain if necessary
>
> the-markup: head the-markup
>
> forall the-markup [if all [found? find first
> the-markup {HREF="} not found? find first the-markup
> "://"][insert find/tail first the-markup {HREF="}
> the-domain]]
>
> the-markup: head the-markup
> print the-markup
>
>
>
>
>
>
> --
> To unsubscribe from this list, please send an email
> to
> [EMAIL PROTECTED] with "unsubscribe" in the
> subject, without the quotes.
>
__________________________________________________
Do You Yahoo!?
Yahoo! Auctions - Buy the things you want at great prices.
http://auctions.yahoo.com/
--
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the
subject, without the quotes.