Nick Lo skrev:

With regards to "standards" I did look into processing the xml format newer Word documents use but since clients had a variety of Word versions and platforms on which they were running it this was not an option. Certainly it is something I must note down to investigate further.

I use Tidy to do this sort of cleaning. My clients copy and paste their texts from Word into TinyMCE, and then the CMS runs Tidy as a Php module.

http://tidy.sourceforge.net/

I have experimented with different settings for Tidy, and this was what I found out to work best in my CMS environment:

           'wrap' => 0,
           'char-encoding' => 'utf8',
           'input-encoding' => 'utf8',
           'output-encoding' => 'utf8',
           'newline' => 'LF',
           'doctype' => 'omit',
           'write-back' => TRUE,
           'quiet' => TRUE,
           'indent' => 'auto',
           'output-xml' => TRUE,
           'bare' => TRUE,
           'clean' => TRUE,
           'logical-emphasis' => TRUE,
           'drop-proprietary-attributes' => TRUE,
           'drop-font-tags' => TRUE,
           'break-before-br' => TRUE,
           'numeric-entities' => TRUE,
           'quote-nbsp' => FALSE,
           'quote-marks' => TRUE,
           'indent-attributes' => TRUE,
           'enclose-text' => TRUE,
           'enclose-block-text' => TRUE,
           'word-2000' => TRUE,
           'tidy-mark' => FALSE,
           'literal-attributes' => TRUE,
           'show-body-only' => TRUE,
           'force-output' => TRUE,
           'ascii-chars' => FALSE,
           'output-bom' => FALSE

/AndersN
*********************************************************
The CMS discussion list for http://webstandardsgroup.org/
*********************************************************

Reply via email to