At 09:01 PM 5/3/2006, Stuart Sherwood wrote:
I'm wondering what is the best way to convert a large text file to XHTML? Preferably, I'd like the conversion to be performed to ignore styles, so the output is clean, semantic markup. I'd rather add my own stlying later.
I think it's impossible to say how challenging this would be without knowing anything about the content of the text file. How organized and consistent are the content and styling? What is there for a parser to grab onto? What verbal and stylistic patterns can it orient itself by? And what's the file format?
I love writing software that parses human language; it's the most fun of any programming I've done (which probably says something about my geek quotient). Writing a parser for your document is probably going to be practical (cost-effective) only if it will be run repeatedly, say on a document that comes to you with fresh content each month, or one a single document that is truly huge. If this is a small one-off job, it would probably be cheaper to do it by hand -- with the aid of macros perhaps, but not scripting the whole thing.
Any chance you can work with the originators of the document to change the way in which it's put together? That could have an enormous effect on the parsing job, including potentially eliminating it altogether.
Paul
****************************************************** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help ******************************************************
