On Friday, August 8, 2003, at 06:24 AM, Serge Huber wrote:
We are trying to address all of the issues above
Hi Tim,
Actually I've also worked on a web clipping servlet/portlet, and I came to a lot of the same conclusion that you and David did.
Basically the trickiest part was to find a way to have a SSO service available to the portlet, and that would allow the user to choose from connector "configurations".
Also another important part is remembering the FORM options. Often when screen scraping you'll want to "remember" some parameters (such as a stock quote identifier) and save that configuration. You can then lookup configurations for the scraping portlet.
I wish I could contribute my code here, but it is something we plan on selling for the moment, so I'm not sure if this will be possible.That's alright. Someone with an open source philosophy will implement it eventually
One last thing : for the HTML parsing I used the Java port of Tidy (http://www.sf.net/projects/jtidy). Although it's been abandonned for quite some time (2001 was the last release), it is quite good at building a DOM of even some very bad HTML. Unfortunately the DOM created is not very standard and has problems once you try to modify it. I tried to get into the code to fix this, but it's a quite complicated parser. But I think the license is freer than the LGPL so it might still be interesting.I don't recommend using a DOM-based rewriter for performance reasons.
The rewriter in jetspeed is event-driven and best fits over SAX-based parsers.
-- David Sean Taylor Bluesunrise Software [EMAIL PROTECTED] +01 707 773-4646 +01 707 529 9194
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
