Hi Tim,
Actually I've also worked on a web clipping servlet/portlet, and I came to a lot of the same conclusion that you and David did.
Basically the trickiest part was to find a way to have a SSO service available to the portlet, and that would allow the user to choose from connector "configurations".
Also another important part is remembering the FORM options. Often when screen scraping you'll want to "remember" some parameters (such as a stock quote identifier) and save that configuration. You can then lookup configurations for the scraping portlet.
I wish I could contribute my code here, but it is something we plan on selling for the moment, so I'm not sure if this will be possible.
One last thing : for the HTML parsing I used the Java port of Tidy (http://www.sf.net/projects/jtidy). Although it's been abandonned for quite some time (2001 was the last release), it is quite good at building a DOM of even some very bad HTML. Unfortunately the DOM created is not very standard and has problems once you try to modify it. I tried to get into the code to fix this, but it's a quite complicated parser. But I think the license is freer than the LGPL so it might still be interesting.
Regards, Serge Huber.
At 08:34 AM 8/8/2003 -0400, you wrote:
[Sorry, another long email]
Here is a url for the attachment http://66.40.167.80/portal/CollaborationWPS.pdf
>what are the differences between WebPagePortlet and WebPagePortlet2?
Perhaps David has more on the vision (David initiated the development for WebPagePortlet2)...but here is the requirement that got me interested in this: Our business requirement at the time; it seemed we needed to use a portlet as a proxy to another web application that was only available from within our firewall. To retrieve this content and successive requests (links/pages in the content, images, etc) we thought about creating a portlet and servlet combination that would retrieve the protected content on behalf of the portal user. [We're now re-architecting to use Tivoli Access Manager and Edge Server (if our APAR's are resolved - this initial requirement will go away; but from a personal view I still am interested in continuing on with the portlet)] So...I saw David's WebPagePortlet2 while I was searching to see if anyone had done this already. He and I talked about using the commons-httpclient b/c it provides so many important features over net.URL (basicauth, NTLM, proxy configuration, proxy authorization (per user/request not just for the jvm), followredirects, and a host of other important http library functionality.) Here is part of an email that should help with what WebPagePortlet2 is: [snip'd] >> I've >> compiled what I think are the basic requirements for the >> WebPageService by >> looking at what you've done, as well as thinking about how I plan to >> use >> this with WPS (Websphere portal server) Here's >> what >> I've come up with: >> >> Purpose: This servlet will proxy web requests on behalf of the user to >> provide portal integration of external web resources from within the >> portal >> framework. The benefit is to maintain continuity of the user >> experience; >> such that the user does not have to leave the portal site in order to >> view >> external pages/resources. >> >> Technical Requirements: >> 1) System must provide the capability to pass along the client >> browser's >> accept header >> 2) System must provide the capability to pass along the client >> browser's >> user-agent >> 3) System must provide the capability to pass along the client >> browser's >> accept-language > >Hopefully the webpage service is doing that now... if not it must do it > >> 4) System must provide the ability to proxy subsequent requests >> requiring >> the manipulation of links, and external uri's back to the intercepting >> servlet to continue the session by proxy. >> 5) System must provide the capability to proxy and maintain >> http-referer >> header on behalf of the user. >> 6) System must provide the capability to proxy and maintain the cookie >> header on behalf of the user. >> 7) System must provide the capability to make requests via an http >> proxy >> server >> 8) System must provide the capability to support http GET and http POST >> requests >> 9) System must provide the ability to make connections via SSL as >> needed. >> >currently not supported by very much needed > >> 10) System must provide a means to sign-on to secured websites via >> Basic >> Auth. >The WebPagePortlet supports it in a very simple solution >I'd like to add that capability to WebPagePortlet2 >Btw - I prefer to merge all features in WebPagePortlet into >WebPagePortlet2, if thats alright... > >> 11) System must provide a means to sign-on to secured websites via >> NTLM. > >Have you seen the recent commit from Mark on a new AccessController for >NTLM? > >> 12) System must provide a means to sign-on to secured websites via >> Form/Cookie based authentication. > >Should be supported >> 13) System must provide the ability to cache-remote web resources for >> subsequent access > >It is supported, but needs some work ( > >> >> Let me know how this looks (and if I've got the right idea, missing >> anything, etc.) > >Sounds great >>
A few things became apparant as I was working on this: a) The "service" to retrieve the content would need to be decoupled from any rewriter, and the portlet. David suggested that as well from some early work I'd sent. b) The possible combinations and uses/options of how the proxying is implemented is too much for a single portlet to cover all options: for example: the remote site credentials could be one for all portal users, or the login username/password, or some user specified values, double that since the same applies to proxy credentials, then add in the fact that since there is overhead, someone may want to follow all links through proxying action, or a certain level of links (sort of wget style), or only the initial page, or only links to that host. The complexity of options for it seemed to be growing exponentially as I thought about the scenarios to cover and too much to capture every possibility in a single portlet and portlet configuration. (Reinforces the need to decouple the service.) c) A credential bank in Jetspeed would be really nice.
>For TransformerService we probably need a good HTML parser, what do you >think? >I've found an HTML parser at http://htmlparser.sourceforge.net/
It looks great, but.. unfortunately it's licensed under LGPL. I had wanted to see about using OpenCMS (also LGPL) sometime ago, but it turns out that LGPL is not compatible with the Apache Foundation License (it has to do with the fact that derivative works of LGPL must also be LGPL or GPL, and the ambiguity around Java import and the "link" terminology in the LGPL. See: http://www.mail-archive.com/[EMAIL PROTECTED]/msg06912.html) I wish ASF had a better policy get through this like asking for an amended license from all source works providers that you want to use. There is some great LGPL software that ASF projects are 'deprived of' can't use because of this ambiguity.
[Almost done] When I was thinking about what I needed to do with the WPS WebPageClipping Portlet I wanted to clip part of a page, then rewrite some of the javascript, and remove another part of the script. So I felt some sort of expression syntax using regular expressions would have been ideal. I had thoughts of a simple xml syntax like <replace findregex="(top)\.\.+" value="parent" backtrace="1"> <remove regexp="#theregex#"> I think a good html parser is probably a must, and working with the html DOM is perhaps needed in certain cases? Alternately, I've done "screen-scraping" in Coldfusion with regular expressions alone. I think there are many ways to do this. In the end, based on your thesis timeline you'll have to choose the scope and features (and then lead into the next iteration ;)
[Whew]
-----Original Message----- From: Marco Mari [mailto:[EMAIL PROTECTED] Sent: Friday, August 08, 2003 4:40 AM To: Jetspeed Developers List Subject: RE: contribute to Jetspeed
Hi Tim,
thank you for your help and explanations! At the moment my (very simple) Web clipping portlet extends WebPagePortlet, what are the differences between WebPagePortlet and WebPagePortlet2?
> Services > WebPageService /RemoteService > TransformerService (the web clipping and transform for another device > service) > Portlets > WebClip Portlet > WebPagePortlet2
I think it's a good vision, and I want to contribute, but I am "new to Java", too.. For TransformerService we probably need a good HTML parser, what do you think? I've found an HTML parser at http://htmlparser.sourceforge.net/ Can you send me your pdf design, because I didn't receive it (mail server problems?). Thanks,
Marco
>-- Messaggio Originale -- >Reply-To: "Jetspeed Developers List" <[EMAIL PROTECTED]> >From: "Tim Reilly" <[EMAIL PROTECTED]> >To: "Jetspeed Developers List" <[EMAIL PROTECTED]> >Subject: RE: contribute to Jetspeed >Date: Thu, 7 Aug 2003 21:02:01 -0400 > > >Hi Marco,
I say go for it! I've personally had bad luck with WPS WebClipping, I've actually considered writing one b/c at the time I needed greater control than what I had or knew. I suppose I could have dug into the transcoder stuff... but by th >t time requirements changed. I'm not knocking the WPS portlet, just pointing out the value in having the source - an open source version.
I was working with David Taylor and Josh Hsieh on the concept of RemoteService interface, and a WebPageServi >e, RemoteConfiguration, and RemoteSession (using commons-httpclient 2.?) Basically, the service would provide "Remote" content services, e.g. http, https (this is the WebPageService), other RemoteService implementations could be WebDAV, FTP, sFTP, >tc. The use for service was/is for WebPagePortlet2. Basically the WebPageService uses httpclient to manage url retrieval as well as authentication, http proxy support, session "proxying", and holding a host configuration for the portlet once it has >equested and set a configuration / holding the cookies, and credentials for a user per host config. I can send you what I had done (anyone else as well). The main reason I contacted David directly about this work - as opposed to on the list is b/c I >still consider myself "new to java", and want to have David's advice, and code review, and what not.
Last I left I off with it... I had sent Josh and David my .war. I think I scared them off with my code? :[ I didn't hear back. David, I realize h >w busy you've been (hopefully it wasn't entirely my code), so perhaps David, Josh, Marco, and myself and anyone else can work on this. I'd see this as the; Services WebPageService /RemoteService TransformerService (the web clipping and trans >orm for another device service) Portlets WebClip Portlet WebPagePortlet2 (many other portlets could use these services)
I've got to imagine David is really tied up with J2? So perhaps others can also provide that code review and guidance >if David is tied up.
Send your thoughts. Thanks
Attached: is pdf design I started with... it's out of date, but roughly the same more or less. I've not gotten into the form based authentication, nor the cache, and global configuration is no lo >ger, and rewriter is up in the portlet (not functioning). At a basic level I've got the manager, service, sessions, session, remotehosts, remotehost. and a portlet that can pull down via NTLM, BasicAuth, ssl, or plain http. The goal with WebPagePort >et2 is to "proxy the user thru" multiple requests, and add several features for authentication etc.
-----Original Message----- From: Marco Mari [mailto:[EMAIL PROTECTED] Sent: Thursday, August 07, 2003 8:13 AM To: Jetspeed Developers List > Subject: contribute to Jetspeed
Hi!
For my master thesis I'd like to contribute to Jetspeed (I've already posted my cHTML and PDA project). If you are interested I'm realizing a Web Clipping portlet (like Ibm or Oracle Web clipping portlets >. Or I can realize a portlet that adapts the content of a Web page to mobile devices (cHTML, WML, PDAs...). In alternative I can contribute to Jetspeed 2, but in this case I have no idea of what to do, and I need a job that can give concrete resul >s in a pair of months, because I have to write something in the thesis :)...
Thanks for any suggestion,
Marco Mari
__________________________________________________________________ Partecipa al concorso Tiscali "collegati e vinci", il prim > premio e' un viaggio per 2 persone a Zanzibar! http://point.tiscali.it/numerounico/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional >ommands, e-mail: [EMAIL PROTECTED]
>--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
__________________________________________________________________ Partecipa al concorso Tiscali "collegati e vinci", il primo premio e' un viaggio per 2 persone a Zanzibar! http://point.tiscali.it/numerounico/
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
- -- --- -----=[ shuber2 at jahia dot com ]=---- --- -- - www.jahia.org : A collaborative source CMS and Portal Server
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
