Nick Lo skrev:
With regards to "standards" I did look into processing the xml format
newer Word documents use but since clients had a variety of Word
versions and platforms on which they were running it this was not an
option. Certainly it is something I must note down to investigate
further.
I use Tidy to do this sort of cleaning. My clients copy and paste their
texts from Word into TinyMCE, and then the CMS runs Tidy as a Php module.
http://tidy.sourceforge.net/
I have experimented with different settings for Tidy, and this was what
I found out to work best in my CMS environment:
'wrap' => 0,
'char-encoding' => 'utf8',
'input-encoding' => 'utf8',
'output-encoding' => 'utf8',
'newline' => 'LF',
'doctype' => 'omit',
'write-back' => TRUE,
'quiet' => TRUE,
'indent' => 'auto',
'output-xml' => TRUE,
'bare' => TRUE,
'clean' => TRUE,
'logical-emphasis' => TRUE,
'drop-proprietary-attributes' => TRUE,
'drop-font-tags' => TRUE,
'break-before-br' => TRUE,
'numeric-entities' => TRUE,
'quote-nbsp' => FALSE,
'quote-marks' => TRUE,
'indent-attributes' => TRUE,
'enclose-text' => TRUE,
'enclose-block-text' => TRUE,
'word-2000' => TRUE,
'tidy-mark' => FALSE,
'literal-attributes' => TRUE,
'show-body-only' => TRUE,
'force-output' => TRUE,
'ascii-chars' => FALSE,
'output-bom' => FALSE
/AndersN
*********************************************************
The CMS discussion list for http://webstandardsgroup.org/
*********************************************************