Hi Matej,

thanks for your research :-)

> It seems that there is just one special API function for double byte  
> character sets, 6300h:
> 
> http://www.ctyme.com/intr/rb-3142.htm
> http://www.ctyme.com/intr/rb-3143.htm
> 
> It returns a table of ranges of valid DBCS leading bytes. This allows  
> applications to detect that it is reading DBCS characters as opposed to  
> ASCII or JIS X 0201 (an 8-bit encoding with ASCII in the lower half and  
> katakana in the upper half).
> 
> An application then simply uses standard DOS functions for everything, for  
> example INT 21h/AH=1 for input and INT 21h/AH=2 for output. DOS of course  
> does not supply any special string functions, so it is up to the  app...

... thanks for the example code ...

> So, what needs to be done?
> 
> 1. INT 21h/AX=6300h has to be implemented.

If that just returns a charset-specific static table, maybe it
would be some sort of charset rendering and keyboard / input
method driver that actually implements this, not the kernel?

> 2. INT 21h/AH=1 (and all other input functions) has to be modified so that  
> if a double byte character is entered, it returns the first byte and  
> remembers the second byte to return it in the next call.

I could imagine that this can also be done in the keyboard
driver, similar to what the BIOS does with function keys
which also have no ASCII equivalent and still use the BIOS
keyboard buffer I/O like everything else...

> 3. INT 21h/AH=2 (and all other output functions) has to be modified so  
> that if it detects a leading byte of a double byte character, it has to  
> remember it and wait until the next call, when it gets the second byte, to  
> print the character.

Well, DOS itself cannot print in charsets beyond 1-byte-per-
char, because it uses the BIOS functions which in turn use
the VGA hardware which cannot have more than 2 x 256 chars
sized fonts. So this again sounds like a job for a DRIVER,
one which uses graphics mode to render extensive fonts. We
already have support for Unicode fonts in a few graphical
DOS text editors and similar (thanks :-)) and whether DBCS
or UTF-8 is used, both share the "size of character can be
one or more bytes" handling "anomaly".

I see your point in avoiding to print "half double bytes",
but because the graphical output is done externally to the
kernel anyway, the disadvantages of DBCS-agnosticism for
int 21 function 2 and similar seem limited: The graphical
font driver would just remember having seen half of a DBCS
itself and draw the actual character as soon as it receives
the second byte of that.

> 4. A keyboard layout has to be made. I have no idea how keyboard
> layouts work in DOS, so I can't say much.

Well layouts are one thing, but for beyond-alphabetic DBCS
input, you probably need an input method driver, which is
separate from the layout for ASCII. Normally that works by
typing short ASCII sequences, typically in a special shift
state, to select a Chinese / Japanese / Korean character.

I assume that such drivers are separately available, also
in free versions, working with any DBCS-enabled DOS system.
Imagine that for example you type Strg-K-A-N-X and when you
release the Strg again, the input method sends two bytes,
in other words one DBCS, through the DOS console driver,
saying "somebody has typed the character named Kanji-Xen"
(I invented that character). So there is not one KEY that
lets you type one "Xen" character, but a WAY to type one.

> 5. A font has to be made. Perhaps the GNU Unifont could be converted?

The abovementioned editors use TrueType Fonts (TTF) as far
as I remember, but in a fixed size way, I believe. So that
conversion step is something people have experience with :)

> 6. Probably the hardest part: all FreeDOS packages, or at least basic ones  
> (FreeCOM, FIND, SORT, EDIT), have to be updated to support double byte  
> characters.

See above for editing. And importantly, note how much of
DOS (and tools) do NOT have to know about the DBCS nature
of text: It makes no difference for FIND if you search for
a four letter word or for four bytes which MEAN two DBCS
characters. Among other things, this is thanks to having
a lead byte / next byte distinction and having no upper
or lower case in CJK languages if I remember correctly.

For the same reason, FreeCOM does not have to care. File
names cannot be more than 8 + 3 BYTES long, but if those
8 bytes happen to have 4 DBCS chars as content, it is the
same to FreeCOM. Only the display driver has to graphically
draw the 4 DBCS chars for you instead of 8 ASCII ones then.
Again, the lead byte trick allows the driver to recognize
whether an incoming byte is ASCII or part of a DBCS. Note
that the 2nd half of a DBCS cannot be distinguished from
ASCII if I guess correctly, so the graphical font display
driver has to remember whether the previous char was a non
printing DBCS lead byte (and which) or not, that's enough.

SORT is a different story, but I do not know whether DOS
is supposed to include Japanese etc aware SORT or whether
that is normally part of a separately available package
of JKC tools. Also, what license do such packages have?

> PS: I just wrote all that and found this:
> 
> http://nokonoko365.cocolog-nifty.com/blogfile/freedos/index.html
> 
> Is that third party software for Japanese support or what?

I do not know. Note that most of my mail above are educated
guesses. I hope they still help inspiring this discussion.

Regards, Eric



------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

Reply via email to