Re: any shortcuts to doc to ascii?

2010-05-28 Thread Gary Kline
On Fri, May 28, 2010 at 10:45:38AM -0400, Bob Hall wrote:
> On Thu, May 27, 2010 at 10:53:39PM -0700, Gary Kline wrote:
> > On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote:
> > > On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote:
> > > > ps: antiword same as catdoc.  back to my per substitutions.
> > > > that works, along with vi's Builtin subs.
> > > 
> > > Have you considered using whatever replaces the most special characters,
> > > and fixing the few characters that remain with sed?
> > 
> > exactly!!!
> 
> Another possibility, if you haven't considered it, is using sed to
> convert everything. If you know all the characters that need to be
> swapped out, you can write a sed script that will do it for you in one
> pass. If you don't know sed, creating the script may be a PITA, but
> you'll only have to do it once, and then you can reuse the script
> whenever needed.
> 
> As I recall, the hard part is figuring out how to represent the special
> characters in sed. It's been a few years since I used sed on doc files,
> but I recall that the character codes that displayed on my screen were
> not the codes that I needed to use in sed scripts.


the DOC file i was trying to convert is only around 250 line
[ ascii ] and i finished it, kwik-and-dirty with perl, sed,
and vi's regex.  it prob'ly isn't worth merely complaining
about.  doing it one time will, as you point out, let me
reuse the script hundred of time.  

(i bot a sed and awk book few years ago.  time to get
serious!)

tx much

gary


> ___
> freebsd-questions@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
The 7.83a release of Jottings: http://jottings.thought.org/index.php
   http://journey.thought.org  99 44/100% Guaranteed Novel

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-28 Thread Nikos Vassiliadis

Polytropon wrote:

On Thu, 27 May 2010 16:36:08 -0700, Gary Kline  wrote:

i don't see any ascii suffix [for OOo].  i saved as .txt.


This should be right. The .txt extension refers to ASCII text,
at least in standard-compliant operating systems.




same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
try antiword.  [forgot about that.  ]


This makes me believe that the original DOC file has been created
with a wrong character set or language setting. "Windows" - as far
as I know - does not use standard locales such as all other systems
do, but uses an arbitrary setting.



It is a valid UTF-8 encoded text:
[...@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text

You'll be able to see the character if you fire up a UTF-8 capable 
terminal with proper locale settings.

[...@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8

After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else, 
for example:

tr ' "'" < $file > $output


Another idea may be that the character that you think should be
an apostrophe isn't an apostrophe. I often do see this in german
texts with misplaces apostrophes that are in fact accent grave
or accent acute, or a character from UTF-8 that just looks like
an apostrophe. For example, if the original document contains

We don`t

and this ` is not a real ', then conversion tools will of course
use the "escape notation" for this unknown character.


Indeed, the standard tool for encoding translations, iconv, chocks on 
this. Yet, it worked when I tried to convert from utf-8 to greek 
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:

http://www.fileformat.info/info/unicode/char/2019/index.htm

HTH, Nikos
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-28 Thread Bob Hall
On Fri, May 28, 2010 at 10:45:38AM -0400, Bob Hall wrote:
> Another possibility, if you haven't considered it, is using sed to
> convert everything. If you know all the characters that need to be

Never mind. I just remembered about the garbage at the beginning of doc
files. I had forgotten that I using both sed and awk to deal with that when I
was working with doc files.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-28 Thread Bob Hall
On Thu, May 27, 2010 at 10:53:39PM -0700, Gary Kline wrote:
> On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote:
> > On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote:
> > >   ps: antiword same as catdoc.  back to my per substitutions.
> > >   that works, along with vi's Builtin subs.
> > 
> > Have you considered using whatever replaces the most special characters,
> > and fixing the few characters that remain with sed?
> 
> exactly!!!

Another possibility, if you haven't considered it, is using sed to
convert everything. If you know all the characters that need to be
swapped out, you can write a sed script that will do it for you in one
pass. If you don't know sed, creating the script may be a PITA, but
you'll only have to do it once, and then you can reuse the script
whenever needed.

As I recall, the hard part is figuring out how to represent the special
characters in sed. It's been a few years since I used sed on doc files,
but I recall that the character codes that displayed on my screen were
not the codes that I needed to use in sed scripts.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-28 Thread Polytropon
On Thu, 27 May 2010 16:36:08 -0700, Gary Kline  wrote:
>   i don't see any ascii suffix [for OOo].  i saved as .txt.

This should be right. The .txt extension refers to ASCII text,
at least in standard-compliant operating systems.



>   same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
>   try antiword.  [forgot about that.  ]

This makes me believe that the original DOC file has been created
with a wrong character set or language setting. "Windows" - as far
as I know - does not use standard locales such as all other systems
do, but uses an arbitrary setting.

Another idea may be that the character that you think should be
an apostrophe isn't an apostrophe. I often do see this in german
texts with misplaces apostrophes that are in fact accent grave
or accent acute, or a character from UTF-8 that just looks like
an apostrophe. For example, if the original document contains

We don`t

and this ` is not a real ', then conversion tools will of course
use the "escape notation" for this unknown character. Other
characters that may lead to such "escape notation" replacements
can be quotation marks (usually typographical ones), ellipsis
and hyphens.

I know I'm saying this too often, but you wouldn't have such
problems with LaTeX. :-)



> > I'm not sure in how far conflicting codepages may be involved.
> > It is known that "Windows" does have problems supporting standards,
> > and this applies to character sets and language variations, too.
> > 
> 
>   your words could be emblazoned in 24k gold on some Monument
>   of Truth. 

It's my job - I'm working for the Ministry of Truth. :-)



> i've been fighting going for mac to OOo and back...

Keep on fighting - I've got a new idea. It's much more complicated
than using OpenOffice for conversion - but it MIGHT work.

1. Open the DOC file in OpenOffice.

2. Mark all content you want to convert, e. g. Ctrl+A.

3. Get it into edit buffer, Ctrl+C.

4. Open KDE's text editor (or any other text editor you have
   installed), output the edit buffer, Ctrl+V.

5. Save the file you now got in the editor. It should be all in
   ASCII and with correct interpretation of "special characters".

Because I don't have a test setting here, I cannot predict that
it will compensate malformed codings, but if OpenOffice shows a
character as an apostrophe, it should be transferred exactly as
that through the edit buffer.



>   ps: antiword same as catdoc.  back to my per substitutions.
>   that works, along with vi's Builtin subs.  

The joy of modern programs: You start to do everything manually
again. :-)




-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-27 Thread Gary Kline
On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote:
> On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote:
> > ps: antiword same as catdoc.  back to my per substitutions.
> > that works, along with vi's Builtin subs.
> 
> Have you considered using whatever replaces the most special characters,
> and fixing the few characters that remain with sed?

exactly!!!

[from pc-bsd//kmail]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-27 Thread Bob Hall
On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote:
>   ps: antiword same as catdoc.  back to my per substitutions.
>   that works, along with vi's Builtin subs.  
Have you considered using whatever replaces the most special characters,
and fixing the few characters that remain with sed?
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-27 Thread Gary Kline
On Thu, May 27, 2010 at 05:03:02AM +0200, Polytropon wrote:
> On Wed, 26 May 2010 18:38:47 -0700, Gary Kline  wrote:
> > 
> > 
> > guys,
> > 
> > is there anything that can take these hex triplets such as
> > 
> > We Don\xe2\x80\x99t
> > 
> > and render them back to the ascii or keyboard equivalents?
> > in this case, the \x99 would be an apostrophe.
> > thus:
> > 
> > 
> > We Don't
> > 
> > tia,
> > 
> > gsry
> > 
> > ps: even lynx -dump messes up, i believe.  i'm trying to go from
> > DOC  back to typewriter 
> 
> 
> Yes, even a typewriter is better than DOC. :-)
> 



man, you got that right!!


> To process DOC files into ASCII, there are several ways, with
> different complexity:
> 
> Most complex ones: Use OpenOffice or Abiword, open the file and
> save it as ASCII. Included "special characters" should be in
> regular ASCII representation now.
> 
> Better: Use (from ports) catdoc or antiword.
> 


i don't see any ascii suffix [for OOo].  i saved as .txt.
same krap.  the \x94, x9d, \x9c...  same with catdoc.  i'll
try antiword.  [forgot about that.  ]

> I'm not sure in how far conflicting codepages may be involved.
> It is known that "Windows" does have problems supporting standards,
> and this applies to character sets and language variations, too.
> 

your words could be emblazoned in 24k gold on some Monument
of Truth.  i've been fighting going for mac to OOo and back...
(**)

thanks.

gary

ps: antiword same as catdoc.  back to my per substitutions.
that works, along with vi's Builtin subs.  

pps::: 

> 
> 
> -- 
> Polytropon
> Magdeburg, Germany
> Happy FreeBSD user since 4.0
> Andra moi ennepe, Mousa, ...

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
The 7.83a release of Jottings: http://jottings.thought.org/index.php
   http://journey.thought.org  99 44/100% Guaranteed Novel

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


Re: any shortcuts to doc to ascii?

2010-05-26 Thread Polytropon
On Wed, 26 May 2010 18:38:47 -0700, Gary Kline  wrote:
> 
> 
> guys,
> 
> is there anything that can take these hex triplets such as
> 
> We Don\xe2\x80\x99t
> 
> and render them back to the ascii or keyboard equivalents?
> in this case, the \x99 would be an apostrophe.
> thus:
> 
> 
> We Don't
> 
> tia,
> 
> gsry
> 
> ps: even lynx -dump messes up, i believe.  i'm trying to go from
> DOC  back to typewriter 


Yes, even a typewriter is better than DOC. :-)

To process DOC files into ASCII, there are several ways, with
different complexity:

Most complex ones: Use OpenOffice or Abiword, open the file and
save it as ASCII. Included "special characters" should be in
regular ASCII representation now.

Better: Use (from ports) catdoc or antiword.

I'm not sure in how far conflicting codepages may be involved.
It is known that "Windows" does have problems supporting standards,
and this applies to character sets and language variations, too.



-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"


any shortcuts to doc to ascii?

2010-05-26 Thread Gary Kline


guys,

is there anything that can take these hex triplets such as

We Don\xe2\x80\x99t

and render them back to the ascii or keyboard equivalents?
in this case, the \x99 would be an apostrophe.
thus:


We Don't

tia,

gsry

ps: even lynx -dump messes up, i believe.  i'm trying to go from
DOC  back to typewriter 



-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
The 7.83a release of Jottings: http://jottings.thought.org/index.php
   http://journey.thought.org  99 44/100% Guaranteed Novel

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"