Re: API for reading M$ word documents on Linux???

Beni Cherniavsky Sun, 06 Jul 2003 11:59:58 -0700

Shachar Shemesh wrote on 2003-07-06:

> Oleg Goldshmidt wrote:
>
> >Shachar Shemesh <[EMAIL PROTECTED]> writes:
> >
> >>Isn't Word Unicode? Doesn't that mean strings will be, pretty much,
> >>useless with it?
> >>
On old Words / windowses (e.g. on win95) the doc is saved with unibyte
encodings (e.g. cp-1255), on newer ones it's saved as Unicode.  I'm
pretty sure it never uses UTF-8 but always UTF-16 (microsoft-endian,
don't remember which way it is).


> >To tell you the truth, last time I did that was a few years ago, but
> >then it was a great way to read Word documents that did not have too
> >much graphics in them. Maybe things changed since then, in which case
> >I'll apologize humbly. I also have no idea if the -e option will be of
> >any use.
> >
> >I must admit I am mostly ignorant about Unicode and other stuff like
> >that. Will it affect grep (which is, IIRC, the OP's ultimate goal)?
> >
> Depends.
>
> If you are using Unicode encoded with UTF-8, all ASCII characters remain
> the same, and it follows trivially that no change to grep's operation.
>
> If you are using UTF-16 (as most Windows apps do. More precisely, most
> Windows app that use Unicode do, which makes it an insignificant
> minority of Windows app, but that's a different story), then ASCII
> characters will appear as one ASCII character, one NULL. Under those
> conditions, I can imagine both grep and strings will have a difficult
> time of parsing the file.
>
Grep - yes, probably.  Using iconv before the grep would help, except
that iconv doesn't seem to have a mode when errors are not fatal.
``recode -f`` from stdin seems to work, but doesn't catch all strigns.
The main problem is probably that UTF-16 is not self-synchronizing --
I doubt UTF-16 strings in doc files are aligned on even bytes :-).

`man strings` contains:
    -e encoding
    --encoding=encoding
        Select the character encoding of the strings that are to be
        found. Possible values for encoding are: s = single-7-bit-byte
        characters (ASCII, ISO 8859, etc., default), S =
        single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit
        littleendian, B = 32-bit bigen- dian, L = 32-bit littleendian.
        Useful for finding wide character strings.

However it doesn't seem to work catch anything for me, both with
``l`` and ``b``.

Best way is to use ``antiword -m UTF-8.txt`` to support all cases ;-).

-- 
Beni Cherniavsky <[EMAIL PROTECTED]>

Israel is moving to 7-digit cellphone numbers since the current
6-digit scheme, although prolonged for some time by supernetting,
"comes to the end of its useful life, once again due to address space
exhaustion" [RFC 1606 on IPv9 :-].  Why won't they just use DNS?

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: API for reading M$ word documents on Linux???

Reply via email to