[REBOL] Re: The "It's Mine Now and I'll Do What I Want With It" Project Proposal

Scott Israel Sat, 10 Mar 2001 11:17:20 -0800
Yes, you can do this. I'm working on a similar project
right now. It's easy to do for any given page, but it 
is difficult to write generic code. My approach is to
parse a webpage, and then run each result through a
series of simple functions each one which tests for
something desired or not desired. After I have taken
what I want, I run it through a html preprocess
function and then the result can be used in generating
a web page.

To make dead links live I generally use split-path on
the main url but there are some exceptions. Some sites
require a function just to find the correct url for
the day.

I tie all these functions together with a master
function so I only need to create one function call to
process any given web page.

Because this function call can be complicated I'm
working on a tool that makes it easy to examine any
web page and then create the parameters needed to 
build the function call.

A good example of this is the  website
http://moreover.com. They extract headlines from news
pages and they produce excellent results. To cover
thousands of sites they must have good generic code.

My first website doing this is at
http://www.geocities.com/tamarind_climb
It has a couple of bugs but it has been working fairly
well. The problem with this is I have to write a new
page of code for each webpage I want to extract
headlines from and that's time consuming. That's why
I'm in the process of rewriting the code to be more
generic.

I think the key idea is to write simple functions that
only do one thing but when combined together have the
power to extract and reformat just about anything from
a web page. When and if I get further along in this I
will make my scripts available.


--- Terry Brownell <[EMAIL PROTECTED]> wrote:
> 
> Goal - Reconstruct a previously read webpage prior
> to saving so that all tags are complete URLs 
> 
> Here's a project some may be interested in
> collaborating on for the good of Reboldom.
> 
> The Problem.
> 
> When an HTML document is read and then saved, many
> of the tags (src, a href etc) become "dead" due to
> the original page referencing a path to the local
> server directly, like so...
> 
> <a href="/news/0-1006-200-5079991.html?tag=tp_pr">
> 
> as opposed to the complete URL, thus...
> 
> <a
>
href="http://www.news.com/news/0-1006-200-5079991.html?tag=tp_pr">
> 
> When the page is then "delivered" outside of its
> domain, the resulting html is marred.  This hinders
> webpage manipulation and must not be allowed to
> continue.
> 
> The Solution
> 
> Now lets say we could replace the "dead" (for lack
> of a proper definition) URLs with "well-formed"
> URLs, what would be some of the advantages?
> 
> A few that come to mind include;
> 
> - Reading a webpage, removing the javascript that
> "breaks" the page out of frames, then delivering it
> to a frame (sneaky huh?)
> - Removing/Replacing banner ads.
> - Marking up the page with XML on the fly
> - Annotating the page
> - Highlight key points
> etc.
> 
> Now this seems like an easy task, but it's
> deceiving.  One may say, "Just insert the domain
> part of the URL into the tags"  (see my "been using
> rebol for months, but still green" script below)
> This works for basic sites, but as the HTML gets
> more and more complex, so the sophistication of
> function.
> 
> For example, some of these "dead" tags get pretty
> wirey... some have a leading "/" and some don't,
> some are embedded into javascript,  and many other
> styles.
> 
> Is this idea too far fetched? Am I not seeing the
> forest for the trees? Is there already a solution?
> 
> Your thoughts and input are much appreciated.
> 
> Terry Brownell
> www.LFReD.com
> 
> Below is the "It's Mine Now 1.0"
> (Note: I know this could be written much better, and
> as a minimum made into a function, but it's a start
> from a starter. Feel free to improve. Also I find
> laying the code out into long lines easier to follow
> and debug. Don't ask me why, maybe cuz I'm
> Canadian.)
> 
> rebol [] 
> 
> the-domain: to-url ask "What domain?" 
> the-markup: load/markup the-domain
> 
> ;The following will check for "dead" SRCs, if true
> then add the domain
> 
> forall the-markup [if all [(type? first the-markup)
> = tag! found? find first the-markup {src="} not
> found? find first the-markup "://"][insert find/tail
> first the-markup {src="} the-domain]]
> 
> ;The following will check for "dead" HREFs and
> replace with domain if necessary
> 
> the-markup: head the-markup
> 
> forall the-markup [if all [found? find first
> the-markup {HREF="} not found? find first the-markup
> "://"][insert find/tail first the-markup {HREF="}
> the-domain]]
> 
> the-markup: head the-markup
> print the-markup
> 
> 
> 
> 
> 
> 
> -- 
> To unsubscribe from this list, please send an email
> to
> [EMAIL PROTECTED] with "unsubscribe" in the 
> subject, without the quotes.
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Auctions - Buy the things you want at great prices.
http://auctions.yahoo.com/
-- 
To unsubscribe from this list, please send an email to
[EMAIL PROTECTED] with "unsubscribe" in the 
subject, without the quotes.
[REBOL] Re: The "It's Mine Now and I'll Do What I Want With It" Project Proposal

Reply via email to