On Wed, 23 Mar 2005 09:12:50 -0800, John X <[EMAIL PROTECTED]> wrote:
[...]

> We already have text.jsp (view as text). It can be made into "view as html".
> However, to do so, we need to introduce new ParseHTML (better ParseXML)
> or convert ParseText into one?

I think this is the best approach -- in fact, I think some sort of
intermediate format (XHTML?) that's easily parsed would be perfect. 
Any sort of XML format should preserve formatting and be
straightforward to generate from PDFs, and would be known to be
parseable, so further transformation could occur without the messiness
of dealing with wild HTML.

It might also be good to trim out <script> and <object> elements to
prevent malicious content, and possibly style information as well
should be clipped.  Given this "reduced HTML", it would be easy to
display results on small/limited browsers like those in Web-enabled
phones.

I don't think we want to get into general-purpose transcoding, but the
ability to add scripts to transform content from specific sites would
also be cool -- similar to a server-side version of Greasemonkey.
[http://greasemonkey.mozdev.org/]

Ken


-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to