Hello,

I have been looking into ways of doing web page scraping.
If there is partial or complete overlap with previous
discussions, please excuse me, it is due to my poor and
partial understanding of this subject.

Basically, page scraping means integrating information from
different web pages into one page (sounds like Jetspeed?).
The canonical example is, say you have several different
web mail accounts (yahoo, hotmail, mail, etc.). Using
web page scraping, you could create a single consolidated
page that presents to you all the messages from all the
mail accounts. This implies several things:

* You transparently log onto each mail service, with
  a potentially different log on protocol.
* You programmatically navigate to the page with the
  mail messages for each service, process ("scrap")
  that page looking for messages, and integrate them
  into your consolidated page, eventually changing the
  content, look and feel and general formatting of the
  original page.
* You translate any URLs or references on the fly, so that
  the links from your consolidated page still work.
* Eventually, you interact with your consolidated page
  (say, you reply to a message) and that in turns triggers
  a new programmatic interaction with the mail service
  that achieves the intended purpose (i.e. it sends
  the reply).

I have the impression that there is at least a level
of overlap between these requirements and what Jetspeed
provides (or will provide); is this correct? Is this
one of the directions Jetspeed would (eventually) move?

I think there is one piece in the page scraping thing
that is not present today in Jetspeed, which is the tools
or model you would use to do the actual scraping: how
do you specify things like:

* On a page with mail messages from yahoo, the From
  line is contained on the second table in the page,
  column 3.
* Strip any content belonging to a form named "foo".
* etc.

I'm not even sure about all the things you would want
to do, but these certainly look like a possibility.
This kind of functionality is provided today by services
such as yodel-e, and I think it would make an interesting
addition to Jetspeed. What would be a good model to
achieve this?

I have been reading a paper about IBM WebEntree, a Java
component that does this kind of thing. The paper is at

  http://www.research.ibm.com/journal/sj/374/zhao.html

and is dated 1998. Anybody knows anything more about this?
I was unable to find any other references to it. Anybody
knows of other (free, open source or commercial) tools
to do this, especially Java-based?

Thanks for any input, comments and flames (which would
prove, in the end, my lack of knowledge in the area).


-- 
Gonzalo A. Diethelm
[EMAIL PROTECTED]


--
--------------------------------------------------------------
Please read the FAQ! <http://java.apache.org/faq/>
To subscribe:        [EMAIL PROTECTED]
To unsubscribe:      [EMAIL PROTECTED]
Archives and Other:  <http://marc.theaimsgroup.com/?l=jetspeed>
Problems?:           [EMAIL PROTECTED]

Reply via email to