I do the same in my CMS, except that I first toss the MS Word input
through WVWare, which converts Word documents to HTML. I then take -that-
HTML and throw it through Tidy - saves a lot of cleaning.

And, after that, I do an XML parse onto the thing and rebuild the entire
content from start to finish, making sure the system strips out everything
that's not allowed or wanted. XML is great for that, just parse by tag, by
attribute, by textnode (content), and each tiny little bit can be checked
against any and all checks you may want to perform. :)


> I use Tidy to do this sort of cleaning. My clients copy and paste their
> texts from Word into TinyMCE, and then the CMS runs Tidy as a Php module.
>
> http://tidy.sourceforge.net/
>
> I have experimented with different settings for Tidy, and this was what
> I found out to work best in my CMS environment:
>
>             'wrap' => 0,
>             'char-encoding' => 'utf8',
>             'input-encoding' => 'utf8',
>             'output-encoding' => 'utf8',
>             'newline' => 'LF',
>             'doctype' => 'omit',
>             'write-back' => TRUE,
>             'quiet' => TRUE,
>             'indent' => 'auto',
>             'output-xml' => TRUE,
>             'bare' => TRUE,
>             'clean' => TRUE,
>             'logical-emphasis' => TRUE,
>             'drop-proprietary-attributes' => TRUE,
>             'drop-font-tags' => TRUE,
>             'break-before-br' => TRUE,
>             'numeric-entities' => TRUE,
>             'quote-nbsp' => FALSE,
>             'quote-marks' => TRUE,
>             'indent-attributes' => TRUE,
>             'enclose-text' => TRUE,
>             'enclose-block-text' => TRUE,
>             'word-2000' => TRUE,
>             'tidy-mark' => FALSE,
>             'literal-attributes' => TRUE,
>             'show-body-only' => TRUE,
>             'force-output' => TRUE,
>             'ascii-chars' => FALSE,
>             'output-bom' => FALSE
>
> /AndersN
> *********************************************************
> The CMS discussion list for http://webstandardsgroup.org/
> *********************************************************
>
>




-- 
Faruk Ates
Web consultant, designer, developer and project manager
www.kurafire.net - www.mediadesign.nl - www.happyclog.com
*********************************************************
The CMS discussion list for http://webstandardsgroup.org/
*********************************************************

Reply via email to