On Wed, 23 Mar 2005 09:12:50 -0800, John X <[EMAIL PROTECTED]> wrote: [...]
> We already have text.jsp (view as text). It can be made into "view as html". > However, to do so, we need to introduce new ParseHTML (better ParseXML) > or convert ParseText into one? I think this is the best approach -- in fact, I think some sort of intermediate format (XHTML?) that's easily parsed would be perfect. Any sort of XML format should preserve formatting and be straightforward to generate from PDFs, and would be known to be parseable, so further transformation could occur without the messiness of dealing with wild HTML. It might also be good to trim out <script> and <object> elements to prevent malicious content, and possibly style information as well should be clipped. Given this "reduced HTML", it would be easy to display results on small/limited browsers like those in Web-enabled phones. I don't think we want to get into general-purpose transcoding, but the ability to add scripts to transform content from specific sites would also be cool -- similar to a server-side version of Greasemonkey. [http://greasemonkey.mozdev.org/] Ken ------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
