Re: any shortcuts to doc to ascii?
On Fri, May 28, 2010 at 10:45:38AM -0400, Bob Hall wrote: > On Thu, May 27, 2010 at 10:53:39PM -0700, Gary Kline wrote: > > On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote: > > > On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote: > > > > ps: antiword same as catdoc. back to my per substitutions. > > > > that works, along with vi's Builtin subs. > > > > > > Have you considered using whatever replaces the most special characters, > > > and fixing the few characters that remain with sed? > > > > exactly!!! > > Another possibility, if you haven't considered it, is using sed to > convert everything. If you know all the characters that need to be > swapped out, you can write a sed script that will do it for you in one > pass. If you don't know sed, creating the script may be a PITA, but > you'll only have to do it once, and then you can reuse the script > whenever needed. > > As I recall, the hard part is figuring out how to represent the special > characters in sed. It's been a few years since I used sed on doc files, > but I recall that the character codes that displayed on my screen were > not the codes that I needed to use in sed scripts. the DOC file i was trying to convert is only around 250 line [ ascii ] and i finished it, kwik-and-dirty with perl, sed, and vi's regex. it prob'ly isn't worth merely complaining about. doing it one time will, as you point out, let me reuse the script hundred of time. (i bot a sed and awk book few years ago. time to get serious!) tx much gary > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org" -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix The 7.83a release of Jottings: http://jottings.thought.org/index.php http://journey.thought.org 99 44/100% Guaranteed Novel ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
Polytropon wrote: On Thu, 27 May 2010 16:36:08 -0700, Gary Kline wrote: i don't see any ascii suffix [for OOo]. i saved as .txt. This should be right. The .txt extension refers to ASCII text, at least in standard-compliant operating systems. same krap. the \x94, x9d, \x9c... same with catdoc. i'll try antiword. [forgot about that. ] This makes me believe that the original DOC file has been created with a wrong character set or language setting. "Windows" - as far as I know - does not use standard locales such as all other systems do, but uses an arbitrary setting. It is a valid UTF-8 encoded text: [...@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file - /dev/stdin: UTF-8 Unicode text You'll be able to see the character if you fire up a UTF-8 capable terminal with proper locale settings. [...@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8 After that, just print the char: python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' and use copy & paste to pass it to tr to translate it to something else, for example: tr ' "'" < $file > $output Another idea may be that the character that you think should be an apostrophe isn't an apostrophe. I often do see this in german texts with misplaces apostrophes that are in fact accent grave or accent acute, or a character from UTF-8 that just looks like an apostrophe. For example, if the original document contains We don`t and this ` is not a real ', then conversion tools will of course use the "escape notation" for this unknown character. Indeed, the standard tool for encoding translations, iconv, chocks on this. Yet, it worked when I tried to convert from utf-8 to greek encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char: http://www.fileformat.info/info/unicode/char/2019/index.htm HTH, Nikos ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Fri, May 28, 2010 at 10:45:38AM -0400, Bob Hall wrote: > Another possibility, if you haven't considered it, is using sed to > convert everything. If you know all the characters that need to be Never mind. I just remembered about the garbage at the beginning of doc files. I had forgotten that I using both sed and awk to deal with that when I was working with doc files. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Thu, May 27, 2010 at 10:53:39PM -0700, Gary Kline wrote: > On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote: > > On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote: > > > ps: antiword same as catdoc. back to my per substitutions. > > > that works, along with vi's Builtin subs. > > > > Have you considered using whatever replaces the most special characters, > > and fixing the few characters that remain with sed? > > exactly!!! Another possibility, if you haven't considered it, is using sed to convert everything. If you know all the characters that need to be swapped out, you can write a sed script that will do it for you in one pass. If you don't know sed, creating the script may be a PITA, but you'll only have to do it once, and then you can reuse the script whenever needed. As I recall, the hard part is figuring out how to represent the special characters in sed. It's been a few years since I used sed on doc files, but I recall that the character codes that displayed on my screen were not the codes that I needed to use in sed scripts. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Thu, 27 May 2010 16:36:08 -0700, Gary Kline wrote: > i don't see any ascii suffix [for OOo]. i saved as .txt. This should be right. The .txt extension refers to ASCII text, at least in standard-compliant operating systems. > same krap. the \x94, x9d, \x9c... same with catdoc. i'll > try antiword. [forgot about that. ] This makes me believe that the original DOC file has been created with a wrong character set or language setting. "Windows" - as far as I know - does not use standard locales such as all other systems do, but uses an arbitrary setting. Another idea may be that the character that you think should be an apostrophe isn't an apostrophe. I often do see this in german texts with misplaces apostrophes that are in fact accent grave or accent acute, or a character from UTF-8 that just looks like an apostrophe. For example, if the original document contains We don`t and this ` is not a real ', then conversion tools will of course use the "escape notation" for this unknown character. Other characters that may lead to such "escape notation" replacements can be quotation marks (usually typographical ones), ellipsis and hyphens. I know I'm saying this too often, but you wouldn't have such problems with LaTeX. :-) > > I'm not sure in how far conflicting codepages may be involved. > > It is known that "Windows" does have problems supporting standards, > > and this applies to character sets and language variations, too. > > > > your words could be emblazoned in 24k gold on some Monument > of Truth. It's my job - I'm working for the Ministry of Truth. :-) > i've been fighting going for mac to OOo and back... Keep on fighting - I've got a new idea. It's much more complicated than using OpenOffice for conversion - but it MIGHT work. 1. Open the DOC file in OpenOffice. 2. Mark all content you want to convert, e. g. Ctrl+A. 3. Get it into edit buffer, Ctrl+C. 4. Open KDE's text editor (or any other text editor you have installed), output the edit buffer, Ctrl+V. 5. Save the file you now got in the editor. It should be all in ASCII and with correct interpretation of "special characters". Because I don't have a test setting here, I cannot predict that it will compensate malformed codings, but if OpenOffice shows a character as an apostrophe, it should be transferred exactly as that through the edit buffer. > ps: antiword same as catdoc. back to my per substitutions. > that works, along with vi's Builtin subs. The joy of modern programs: You start to do everything manually again. :-) -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Thursday 27 May 2010 05:18:07 pm Bob Hall wrote: > On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote: > > ps: antiword same as catdoc. back to my per substitutions. > > that works, along with vi's Builtin subs. > > Have you considered using whatever replaces the most special characters, > and fixing the few characters that remain with sed? exactly!!! [from pc-bsd//kmail] ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Thu, May 27, 2010 at 04:36:08PM -0700, Gary Kline wrote: > ps: antiword same as catdoc. back to my per substitutions. > that works, along with vi's Builtin subs. Have you considered using whatever replaces the most special characters, and fixing the few characters that remain with sed? ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Thu, May 27, 2010 at 05:03:02AM +0200, Polytropon wrote: > On Wed, 26 May 2010 18:38:47 -0700, Gary Kline wrote: > > > > > > guys, > > > > is there anything that can take these hex triplets such as > > > > We Don\xe2\x80\x99t > > > > and render them back to the ascii or keyboard equivalents? > > in this case, the \x99 would be an apostrophe. > > thus: > > > > > > We Don't > > > > tia, > > > > gsry > > > > ps: even lynx -dump messes up, i believe. i'm trying to go from > > DOC back to typewriter > > > Yes, even a typewriter is better than DOC. :-) > man, you got that right!! > To process DOC files into ASCII, there are several ways, with > different complexity: > > Most complex ones: Use OpenOffice or Abiword, open the file and > save it as ASCII. Included "special characters" should be in > regular ASCII representation now. > > Better: Use (from ports) catdoc or antiword. > i don't see any ascii suffix [for OOo]. i saved as .txt. same krap. the \x94, x9d, \x9c... same with catdoc. i'll try antiword. [forgot about that. ] > I'm not sure in how far conflicting codepages may be involved. > It is known that "Windows" does have problems supporting standards, > and this applies to character sets and language variations, too. > your words could be emblazoned in 24k gold on some Monument of Truth. i've been fighting going for mac to OOo and back... (**) thanks. gary ps: antiword same as catdoc. back to my per substitutions. that works, along with vi's Builtin subs. pps::: > > > -- > Polytropon > Magdeburg, Germany > Happy FreeBSD user since 4.0 > Andra moi ennepe, Mousa, ... -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix The 7.83a release of Jottings: http://jottings.thought.org/index.php http://journey.thought.org 99 44/100% Guaranteed Novel ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: any shortcuts to doc to ascii?
On Wed, 26 May 2010 18:38:47 -0700, Gary Kline wrote: > > > guys, > > is there anything that can take these hex triplets such as > > We Don\xe2\x80\x99t > > and render them back to the ascii or keyboard equivalents? > in this case, the \x99 would be an apostrophe. > thus: > > > We Don't > > tia, > > gsry > > ps: even lynx -dump messes up, i believe. i'm trying to go from > DOC back to typewriter Yes, even a typewriter is better than DOC. :-) To process DOC files into ASCII, there are several ways, with different complexity: Most complex ones: Use OpenOffice or Abiword, open the file and save it as ASCII. Included "special characters" should be in regular ASCII representation now. Better: Use (from ports) catdoc or antiword. I'm not sure in how far conflicting codepages may be involved. It is known that "Windows" does have problems supporting standards, and this applies to character sets and language variations, too. -- Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ... ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
any shortcuts to doc to ascii?
guys, is there anything that can take these hex triplets such as We Don\xe2\x80\x99t and render them back to the ascii or keyboard equivalents? in this case, the \x99 would be an apostrophe. thus: We Don't tia, gsry ps: even lynx -dump messes up, i believe. i'm trying to go from DOC back to typewriter -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix The 7.83a release of Jottings: http://jottings.thought.org/index.php http://journey.thought.org 99 44/100% Guaranteed Novel ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"