On 26 Jun 00 at 12:23, owner-arachne-digest@arachne wrote: > > Date: Sat, 24 Jun 2000 17:22:17 +0000 > From: "Bastiaan Edelman" <[EMAIL PROTECTED]> > Subject: Re: Viewing RTF and other stuff > > > So what I have learned to live with is deformatting everything to > > ASCII text first, then do the codepage conversion and finally > > reformate the texts manually again. I cannot say that I am happy with > > this. What about setting up a new World Championship of Converting > > Formatted Documents while Preserving Cultural Diversity of Eastern > > and Western Europe? > > I made a conversion program (in BASIC) to speed up the conversion. > First I convert everything to 8-bit ASCII... all the information is > still in the text file but often not in the wanted caracters. > Then I start the BASIC conversion program that has a list of caracters > to convert into any other wanted caracter. My point was not really the conversion problem. As I have never learned any serious programming language I am using Basic programs, too. It is a bit slow, but it works. In order to achieve really accurate translation of one 8-bit codepage to another 8-bit codepage I use 16-bit UNICODE standard as an intermediate. Translation tables for translations to and from UNICODE can be found at <ftp://ftp.unicode.org:21/Public/> an explanation at <http://charts.unicode.org/> What I nevertheless need to do manually then and what I do not like at all is the reformatting of documents. Even if I had HTML or RTF with perfect format (and logical structure!) before, after conversion procedure, I have to take the ASCII plain text and start to look for chapter headlines from the very beginning again. This is what I would like to improve. Let me try to summarize my experiences: Conversion from to method Word for DOS Word for DOS evident file structure - (only West simple home-made European) conversion utilities Word for WIN Word for DOS only through ASCII with West European VIEW.EXE or CatDoc (not without problems) Word for WIN Word for DOS only through ASCII with East European CatDoc (not without problems) RTF Word for DOS format directly West European supported RTF Word for DOS ??? East European HTML Word for DOS through RTF by utilities West European MARTHA/ISHTAR HTML Word for DOS through RTF ??? East European but DOX does not work (1) RTF HTML MARTA/ISHTAR, DOX, R2H West European RTF HTML R2H utility (DOS version) East European DOX (not without problems) HTML RTF MARTHA/ISHTAR West European HTML RTF ??? East European DOX does not work (1) Word for DOS RTF format directly supported West European Word for DOS RTF ??? East European Word for DOS HTML through RTF by utilities West European MARTHA/ISHTAR Word for DOS HTML through RTF by utilities East European R2H or DOX Of course this table is not complete. According to HTML-formate it is of high significance, whether diacritical characters are is expressed in 8-bit code, so-called entities (e.g. üaut; BTW: are they already standardized for Eastern Europe?) or finally in Code numbers (e.g. &#NNN;) Word for DOS is my preferred editor, because it supports style sheets, RTF output and logical structuring of documents. But the reformatting problem ist more general, as I wanted to indicate mentioning HTML. This is I again post it here. (1) Annotation: I found a shareware program DOX <http://users.hunterlink.net.au/~mabatp/> Perhaps it can be helpful in some of those cases I indicated by question marks. I cannot get it to run in the direction from HTML to RTF. Is this a feature limited to registered version? I am not expecting to find a freeware utility for anything ready on the web. But not being a programmer, I would be grateful for an explanation on the structure of those East European RTF files. > This works very nice... except for some caracters that are NOT in the > WORD caracterset. > Than WORD gives a large string with means: this caracter can be found in > the special WORD codepage #x and you have get caracter #y on that > codepage. > An example is the 'omega' sign, much used in electronics to denote > 'resistance' and also in the Greek alfabet. > In code page 437 it is #234 but Micro$oft gives something like the > string:"}{\f0\fs22 {\field{\*\fldinst SYMBOL 87 \\f "Symbol" \\s > 11}{\fldrslt\f3\fs22}}}{\f0\fs22" I havenot got any idea, what this can be. It reminds me of the language of Word's printer driver editor... Greetings Christof Lange - Prague [EMAIL PROTECTED]
