Re: Web aggregation/scraping

sbelt Fri, 10 Nov 2000 10:49:44 -0800
Actually, this is what I have been trying to accomplish for my company.

* A few months ago, I posted a Portlet that would strip out everything
between the <BODY> and </BODY> tags, update all the href=, src=, action=
tags to make the links work (converting relative links to absolute links).
I'll see if I can find it in the archives

* I have just yesterday posted code to send the logged-in userid/password to
a URL to retrieve user specific data. But here is the wrinkle: I log into
jetspeed as sbelt / password. I know however that sbelt is already "taken"
at yahoo. There are other message - in fact just in the last couple days -
where people are discussing the ability to set parameters for a portlet.
This would be the code which would allw you to specify an id/password you
will use to access the URL.

In the short term, My plan is to add to my portlet a parameter value which
will be appended to the Turbineuserid before sending it to the target URL.
i.e., sbelt->sbelt19660129 ch is very likely to be unique).

What I have not done is coded anything to pull a subset of the page
delivered by the target URL. The XML standard under the subject of XLink
defines the ability to do this. Unfortunately, if the target URL is not
"clean" HTML, it will not work. (In fact, my code whcih extracted the
content between <BODY> tags used string searches as most of the pages I
found were not "clean").

I'll be interested to see where you go with this. Let me know if I can help.

Steve B.

----- Original Message -----
From: "Diethelm Guallar, Gonzalo" <[EMAIL PROTECTED]>
To: "'JetSpeed'" <[EMAIL PROTECTED]>
Sent: Friday, November 10, 2000 8:11 AM
Subject: Web aggregation/scraping


> Hello,
>
> I have been looking into ways of doing web page scraping.
> If there is partial or complete overlap with previous
> discussions, please excuse me, it is due to my poor and
> partial understanding of this subject.
>
> Basically, page scraping means integrating information from
> different web pages into one page (sounds like Jetspeed?).
> The canonical example is, say you have several different
> web mail accounts (yahoo, hotmail, mail, etc.). Using
> web page scraping, you could create a single consolidated
> page that presents to you all the messages from all the
> mail accounts. This implies several things:
>
> * You transparently log onto each mail service, with
>   a potentially different log on protocol.
> * You programmatically navigate to the page with the
>   mail messages for each service, process ("scrap")
>   that page looking for messages, and integrate them
>   into your consolidated page, eventually changing the
>   content, look and feel and general formatting of the
>   original page.
> * You translate any URLs or references on the fly, so that
>   the links from your consolidated page still work.
> * Eventually, you interact with your consolidated page
>   (say, you reply to a message) and that in turns triggers
>   a new programmatic interaction with the mail service
>   that achieves the intended purpose (i.e. it sends
>   the reply).
>
> I have the impression that there is at least a level
> of overlap between these requirements and what Jetspeed
> provides (or will provide); is this correct? Is this
> one of the directions Jetspeed would (eventually) move?
>
> I think there is one piece in the page scraping thing
> that is not present today in Jetspeed, which is the tools
> or model you would use to do the actual scraping: how
> do you specify things like:
>
> * On a page with mail messages from yahoo, the From
>   line is contained on the second table in the page,
>   column 3.
> * Strip any content belonging to a form named "foo".
> * etc.
>
> I'm not even sure about all the things you would want
> to do, but these certainly look like a possibility.
> This kind of functionality is provided today by services
> such as yodel-e, and I think it would make an interesting
> addition to Jetspeed. What would be a good model to
> achieve this?
>
> I have been reading a paper about IBM WebEntree, a Java
> component that does this kind of thing. The paper is at
>
>   http://www.research.ibm.com/journal/sj/374/zhao.html
>
> and is dated 1998. Anybody knows anything more about this?
> I was unable to find any other references to it. Anybody
> knows of other (free, open source or commercial) tools
> to do this, especially Java-based?
>
> Thanks for any input, comments and flames (which would
> prove, in the end, my lack of knowledge in the area).
>
>
> --
> Gonzalo A. Diethelm
> [EMAIL PROTECTED]
>
>
> --
> --------------------------------------------------------------
> Please read the FAQ! <http://java.apache.org/faq/>
> To subscribe:        [EMAIL PROTECTED]
> To unsubscribe:      [EMAIL PROTECTED]
> Archives and Other:  <http://marc.theaimsgroup.com/?l=jetspeed>
> Problems?:           [EMAIL PROTECTED]



--
--------------------------------------------------------------
Please read the FAQ! <http://java.apache.org/faq/>
To subscribe:        [EMAIL PROTECTED]
To unsubscribe:      [EMAIL PROTECTED]
Archives and Other:  <http://marc.theaimsgroup.com/?l=jetspeed>
Problems?:           [EMAIL PROTECTED]
Re: Web aggregation/scraping

Reply via email to