I do the same in my CMS, except that I first toss the MS Word input through WVWare, which converts Word documents to HTML. I then take -that- HTML and throw it through Tidy - saves a lot of cleaning.
And, after that, I do an XML parse onto the thing and rebuild the entire content from start to finish, making sure the system strips out everything that's not allowed or wanted. XML is great for that, just parse by tag, by attribute, by textnode (content), and each tiny little bit can be checked against any and all checks you may want to perform. :) > I use Tidy to do this sort of cleaning. My clients copy and paste their > texts from Word into TinyMCE, and then the CMS runs Tidy as a Php module. > > http://tidy.sourceforge.net/ > > I have experimented with different settings for Tidy, and this was what > I found out to work best in my CMS environment: > > 'wrap' => 0, > 'char-encoding' => 'utf8', > 'input-encoding' => 'utf8', > 'output-encoding' => 'utf8', > 'newline' => 'LF', > 'doctype' => 'omit', > 'write-back' => TRUE, > 'quiet' => TRUE, > 'indent' => 'auto', > 'output-xml' => TRUE, > 'bare' => TRUE, > 'clean' => TRUE, > 'logical-emphasis' => TRUE, > 'drop-proprietary-attributes' => TRUE, > 'drop-font-tags' => TRUE, > 'break-before-br' => TRUE, > 'numeric-entities' => TRUE, > 'quote-nbsp' => FALSE, > 'quote-marks' => TRUE, > 'indent-attributes' => TRUE, > 'enclose-text' => TRUE, > 'enclose-block-text' => TRUE, > 'word-2000' => TRUE, > 'tidy-mark' => FALSE, > 'literal-attributes' => TRUE, > 'show-body-only' => TRUE, > 'force-output' => TRUE, > 'ascii-chars' => FALSE, > 'output-bom' => FALSE > > /AndersN > ********************************************************* > The CMS discussion list for http://webstandardsgroup.org/ > ********************************************************* > > -- Faruk Ates Web consultant, designer, developer and project manager www.kurafire.net - www.mediadesign.nl - www.happyclog.com ********************************************************* The CMS discussion list for http://webstandardsgroup.org/ *********************************************************
