UTF-8 and antiword-0.32

Markus Kuhn Sat, 13 Oct 2001 13:12:08 -0700

I tried out antiword-0.32 from

  http://www.winfield.demon.nl/


however I always get the error message

  I can't open your mapping file (UTF-8.txt)

or if I switch to the C locale then

  I can't open your mapping file (8859-1.txt)

I couldn't find these files in the distribution. I also don't understand
the need for these mapping lines, because for Unicode to ISO 8859-1 and
UTF-8 transformation, there exist obvious conversion algorithms and no
mapping tables should be needed.

What did I missunderstand?

I also noted that the attached patch might extend Unicode case
conversion on modern systems (such as any Linux with glibc 2.2 or newer)
to all scripts.

If you are unfamiliar with the new wide character functions of ISO C99
that are already fully implemented under Linux, have a look at

  http://www.cl.cam.ac.uk/~mgk25/volatile-kd67h/ISO-C-FDIS.1999-04.txt

Suggestion:

I think, it would be much better, if the conversion from Unicode to
8-bit legacy encodings [currently in chartrans.c:ulTranslateCharacters()]
and the conversion from Unicode to UTF-8 [currently in
word2text.c:vStoreCharacter()] would actually happen in the same function.

Reason:

This would simplify the replacement of antiword's entire
Unicode to local encoding conversion at one single place by a
call to the ISO C wcrtomb() function, which converts a wide characters
into a locale-dependent multi-byte character representation.

On modern systems such as glibc, the wchar_t wide characters are always
encoded in Unicode, a guarantee that the ISO C99 compiler signals by
defining the __STDC_ISO_10646__ macro. The glibc 2.2 library has already
conversion tables for practically every existing character encoding
built in, it also can do conversion to UTF-8, and it does this all in a
locale-dependent way, such that you do not even have to worry about
character encoding selection.

It would be nice, if on those systems that define __STDC_ISO_10646__,
the entire Unicode to local encoding conversion could be left to
the C library and users wouldn't have to worry about installing
separate mapping tables for every single tool (including antiword).
The UTF-8 encoder is still useful for use on older C compilers
that don't have __STDC_ISO_10646__ defined though. The call to
wcrtomb() should certainly be done within #ifdef __STDC_ISO_10646__.

There should also be somewhere at the beginning the code

#ifdef __STDC_ISO_10646__
  if (!setlocale(LC_CTYPE, "")) {
    fprintf(stderr, "Can't set the specified locale! "
            "Check LANG, LC_CTYPE, LC_ALL.\n");
    exit(1);
  }
#endif

added, such that the C library hears about the locale setting.

Then you can print for example the Euro sign (U+20AC) as in

#ifdef __STDC_ISO_10646__
  if (printf("%lc", 0x20ac) < 0)   /* try C's multibyte output mechanism */
#endif
    printf("EUR");                 /* if that fails: ASCII fallback */

without having to worry about conversion tables or locale settings.

Cheers,

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

--- antiword.0.32/chartrans.c   Tue Aug 21 19:06:09 2001
+++ antiword/chartrans.c        Sat Oct 13 21:14:59 2001
@@ -9,6 +9,9 @@
 #include <stdlib.h>
 #include <string.h>
 #include <ctype.h>
+#ifdef __STDC_ISO_10646__
+#include <wctype.h>
+#endif
 #include "antiword.h"
 
 static const unsigned short usCp1252[] = {
@@ -373,6 +376,11 @@
 unsigned long
 ulToUpper(unsigned long ulChar)
 {
+#ifdef __STDC_ISO_10646__
+       /* If this is ISO C99 and all locales have wchar_t = ISO 10646
+        * (e.g., glibc 2.2 or newer), then use standard function */
+       return towupper(ulChar);
+#else
        if (ulChar < 0x80) {
                /* US ASCII: use standard function */
                return (unsigned long)toupper((int)ulChar);
@@ -386,4 +394,5 @@
                return ulChar & ~0x20;
        }
        return ulChar;
+#endif
 } /* end of ulToUpper */

UTF-8 and antiword-0.32

Reply via email to