Re: [WSG] Other character sets/languages
Gene Falck wrote: Do you suppose Microsoft fixed Notepad when they coded Windows XP? Yes, it's pretty safe to assume that enhancements to Notepad do not get their own press release ... AFAIK, **all** my files are missing the http headers Correct, http headers are only sent by a web server. That said, installing Apache on Windows is quite simple, as long as you have an administrator account. Download it from http://www.apache.org/dyn/closer.cgi/httpd/binaries/win32/ (choose the apache_1.3.33-win32-x86-no_src.msi file), launch the installer, supply a domain name (localhost is a safe choice), a (whichever) email address and you are ready to go. Start the server, point your browser to http://localhost and a welcome page will appear. If you go to Apache's htdocs subdirectory, throw away any content and put your files there, refreshing your browser will display your very own index.htm. That's more or less all. Keep the installer for when you're going to uninstall Apache. To check the http headers you can download the standalone ViewHead from http://www.pc-tools.net/win32/viewhead, or install a Mozilla extension from http://livehttpheaders.mozdev.org (after installing and restarting the browser, rightclick, select View Page Info and then the Headers tab). After a while, you'll feel ready to play with the various config options. These are stored in a textfile called httpd.conf in Apache's conf subdirectory. Follow the instructions within the file, restart the server to apply the changes and have fun. Almost everything that works on Windows will work the same way on a Linux/Unix web server, so you may safely test at home before applying to a production server. Should you need more instructions, a default install will put a lot of useful content at http://localhost/manual. djn -- Dejan Kozina Dolina 346 (TS) - I-34018 Italy tel./fax: +39 040 228 436 - cell.: +39 348 7355 225 http://www.kozina.com/ - e-mail: [EMAIL PROTECTED] ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Hi Dejan, You wrote: I thought nothing of the fact that I have not seen such a result in IE6 and Mozilla 1.7. Mozilla 1.7.5 still proudly displays an ugly BOM, IE doesn't. Hmm--very interesting. I have not seen any BOM effects even though I use Mozilla at home (IE6 at work) so I downloaded XVI32 and checked some of my files composed and saved in Notepad, some with ctrl-s and some using Save as, choosing the UTF-8 encoding, and have yet to find one with a BOM at the beginning. Do you suppose Microsoft fixed Notepad when they coded Windows XP? As long as you have a web server on your intranet it shouldn't do any difference to the browser, it's just documents coming from the network. It's files from your disk that will miss the http headers. The setup at work was never intended to serve HTML. We have a program that runs things like payroll, work scheduling, and inventory that runs on the LAN; we also use the F:\ drive bit to share Excel and Word files. So, I can use an HTML file from a floppy disk, the C:\ drive, the F:\ drive, passed to me as an internal email attachment, or even from a flash memory unit on a USB plug in. AFAIK, **all** my files are missing the http headers. Regards, Gene Falck, [EMAIL PROTECTED] ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
Oops. Of course that URI should have read: http://www.w3.org/International/technique-index#language From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Richard Ishida Sent: 25 February 2005 08:30 To: wsg@webstandardsgroup.org Subject: RE: [WSG] Other character sets/languages John, You should indeed declare the page to be Vietnamese, and if there are English passages or phrases embedded in the file you should declare those to be English on the elements that surround them. For an explanation of this, see our new techniques index at http://localhost/International/technique-index#language (note that this allows you to drill down to 2 further levels of detail). RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Well, http://www.w3.org/International/technique-index#language I guess. djn Richard Ishida wrote: http://localhost/International/technique-index#language begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
RE: [WSG] Other character sets/languages
Hello Lea, I note that you used incorrect syntax for your CSS declarations - ending declarations with ':' rather than ';'. I assume this is just a typo in this message, rather than the potential source of the problems you had, since in a CSS file it would generally cause the declaration to fail. RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lea de Groot Sent: 21 February 2005 21:05 To: wsg@webstandardsgroup.org Subject: RE: [WSG] Other character sets/languages On Mon, 21 Feb 2005 09:43:40 -, Richard Ishida wrote: In any case you should always finish a font-family declaration with 'serif' or 'sans-serif' in this situation. Then if none of the fonts you indicated are on the user's system, a font that they do have will be used. Caveat alert! Errr, sort of an inverse caveat, if you take this too far. I had a site where I thought 'I do not care what font this part appears in, let them choose which serif font it has and used: #block {font-family: serif: } Bad move :( Some versions of IE (some V6 variant IIRC) showed a lovely set of black square blocks instead of text. :( We checked the browser and it didn't have a bizarre selection as its default font. Changing the declaration to a simple: #block {font-family: Times, serif: } fixed the problem. FYI Lea -- Lea de Groot Elysian Systems - I Understand the Internet http://elysiansystems.com/ Search Engine Optimisation, Usability, Information Architecture, Web Design Brisbane, Australia ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help ** ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
On Tue, 22 Feb 2005 08:31:09 -, Richard Ishida wrote: I note that you used incorrect syntax for your CSS declarations - ending declarations with ':' rather than ';'. I assume this is just a typo in this message, rather than the potential source of the problems you had, since in a CSS file it would generally cause the declaration to fail. Ah, yeah, its a typo - I didn't cut and paste, but typed it from memory; this was a while ago :) Thanks for the pickup. (Just between you, me, and the other 1000 members of the list, I make that typo about once per project, mostly in PHP, so I catch it fairly quickly ;)) warmly Lea -- Lea de Groot Elysian Systems - I Understand the Internet http://elysiansystems.com/ Search Engine Optimisation, Usability, Information Architecture, Web Design Brisbane, Australia ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina Sent: 20 February 2005 22:46 To: wsg@webstandardsgroup.org Subject: Re: [WSG] Other character sets/languages More generally, inputing characters not native to my keyboard/OS is to me the most annoying part of it all (I routinely have to input central-european stuff by switching the keyboard layout, meaning I had to remember which key becomes which). If you have the luck to get your content already typed, copy/paste is much more error-proof than the alternatives. Then you might like these pickers - designed for non-native user input. (Note that the Latin diacritics picker probably includes most of what's needed for Vietnamese.) http://people.w3.org/rishida/scripts/pickers/ Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina Sent: 21 February 2005 04:49 One thing I've just thought of. The final hurdle in letting the world see vietnamese text is hoping that the visitor's browser has a font capable of displaying the text. There is not much you can do if it doesn't, but if it has one you should allow the browser to choose it avoiding to declare a font-family for that part of the page. Most likely, people who want to read (not look at) Vietnamese text will have fonts that support the characters. Note also that you can specify your prefered font in the CSS, but the font-family property allows you to specify more than one font for fallback support. For example, if you research the user base and discover that there are two or three Unicode fonts in common use, you can include them all. In any case you should always finish a font-family declaration with 'serif' or 'sans-serif' in this situation. Then if none of the fonts you indicated are on the user's system, a font that they do have will be used. eg. body { font-family: My preferred viet font, An alternative font, sans-serif; ... } hth RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck Sent: 20 February 2005 04:26 OK, I understand about the BOM but this still leaves me wondering how to save properly. I usually code using Notepad which offers, from the Save As... menu choice, the Encoding options: ANSI Unicode Unicode big endian UTF-8 but no UTF-6 BOM. How can I be sure I am saving in the right way? People on the list may also find the following resource useful. It indicates how to save files in UTF-8 from a number of different editing environments. Setting encoding in web authoring applications http://www.w3.org/International/questions/qa-setting-encoding-in-application s Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Richard Ishida wrote: In any case you should always finish a font-family declaration with 'serif' or 'sans-serif' in this situation. Then if none of the fonts you indicated are on the user's system, a font that they do have will be used. Good point. Lesson learned: I really shouldn't write heady stuff before sunrise and a fair serving of coffee. What I had in mind was rather the case (admittedly rare, but happened to me) when a non-Unicode font has the same name as a Unicode one. The culprit in my case was Georgia with CE characters, back then when W2k was brand new. Made a website assuming every Georgia has the full set of Latin glyphs, while my customer had an Italian Win98 supplied with a Win-1252 Georgia... Still hate those empty squares. Researching the user base is something I find iffy anyway. Every once in a while there is a thread trying to find a safe sequence of fonts usable both on Windows and MacOS, and it ends up with a boatload of different typefaces, plus assorted arguments about display details. Directly asking a vietnamese designer might be more straightforward. Anyway, my suggestion should be more correctly amended to: 'use a generic font-family and let the browser help itself, rather than risk a miss trying to overdesign the appearance'. djn begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
RE: [WSG] Other character sets/languages
Then you might like these pickers - designed for non-native user input. (Note that the Latin diacritics picker probably includes most of what's needed for Vietnamese.) http://people.w3.org/rishida/scripts/pickers/ Thanks for that, very useful. I was skeptical, Vietnamese having such a wide variety of accents, double-accents, and even accents below as well as above the letter, but I was pleasantly surprised. I think they're all there and any set that includes the letter O with a little comma sticking out of the side plus a teeny question mark floating over the top (as seen in everyone's favourite Vietnamese word, Ph) seems to be pretty much complete. Thanks again everyone for your help. I'll let you look at the website when it's done. Oh and incidentally, the Vietnamese Professionals Society are the body that looks after this kind of thing, fonts, keyboard layouts and so on, and they use and recommend Unicode here: http://www.vps.org/rubrique.php3?id_rubrique=91 so they're solidly on board with standards too. Have You Validated Your Code? John Horner(+612 / 02) 9333 3488 Senior Developer, ABC Online http://www.abc.net.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck Sent: 20 February 2005 04:26 OK, I understand about the BOM but this still leaves me wondering how to save properly. I usually code using Notepad which offers, from the Save As... menu choice, the Encoding options: ANSI Unicode Unicode big endian UTF-8 but no UTF-6 BOM. How can I be sure I am saving in the right way? I think you need to use a different editor, or (as I do) strip the BOM off before publishing. You may also find the following article useful. It explains the BOM and the effects it can sometimes have on pages when present: http://www.w3.org/International/questions/qa-utf8-bom FAQ: Unexpected characters or blank lines Here is the code of a Perl script I use to strip the BOM. It's just a quick hack, nothing beautiful, but it may help you or others when you cannot avoid saving with a BOM. (I call it by invoking a batch file in my Windows directory: removebom filename.) === # program to remove a leading UTF-8 BOM from a file # works both STDIN - STDOUT and on the spot (with filename as argument) if ($#ARGV 0) { print STDERR Too many arguments!\n; exit; } my @file; # file content my $lineno = 0; my $filename = @ARGV[0]; if ($filename) { open( BOMFILE, $filename ) || die Could not open source file for reading.; while (BOMFILE) { if ($lineno++ == 0) { if ( index( $_, '?' ) == 0 ) { s/^\xEF\xBB\xBF//; print BOM found and removed.\n; } else { print No BOM found.\n; } } push @file, $_ ; } close (BOMFILE) || die Can't close source file after reading.; open (NOBOMFILE, $filename) || die Could not open source file for writing.; foreach $line (@file) { print NOBOMFILE $line; } close (NOBOMFILE) || die Can't close source file after writing.; } else { # STDIN - STDOUT while () { if (!$lineno++) { s/^\xEF\xBB\xBF//; } push @file, $_ ; } foreach $line (@file) { print $line; } } === HTH RI Richard Ishida W3C contact info: http://www.w3.org/People/Ishida/ W3C Internationalization: http://www.w3.org/International/ Publication blog: http://people.w3.org/rishida/blog/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
RE: [WSG] Other character sets/languages
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck Sent: 20 February 2005 04:26 In this matter, I am also wondering where using a meta tag specifying iso-8859-1 fits in terms of following the standards. I notice many people do this and I gather the actual coding of keystrokes (on a standard PC keyboard set up for US English) should be the same. Is saving a file as UTF-8 compatible with the iso-8859-1 meta tag? Nope. Please save the file in the same encoding as you declare it to be in the meta statement. This seems to be such a common question/mistake that the W3C is beginning to write an article on the subject. The basic ASCII set of characters (ie. the first 127 characters) use the same bytes in iso 8895-1 and utf-8, but as soon as you include a copyright sign, an accented character, etc, you will have problems. Besides which, it is always better to be consistent anyway, and doesn't cost much. hth RI ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
I usually code using Notepad Better use something like PSPad wchich offers you the choice not to include these ident. bytes. file as UTF-8 compatible with the iso-8859-1 meta tag? Eh, nope. If you start using non-ASCII characters (curly quotes etc.) it would break the page... -- Jan Brasna aka JohnyB :: alphanumeric.cz | janbrasna.com Stop IE! - http://www.stopie.com/ | http://browsehappy.com/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Gene wrote: I usually code using Notepad which offers, from the Save As... menu choice, the Encoding options: I'm not really sure, as the Notepad I got with Win98 doesn't offer anything but 'text file' and 'all files'(Win98 doesn't do Unicode). What you can try is to save the page as utf-8, open it in Mozilla/Firefox and check the very first characters displayed. If there is no strange character there you know it's OK. I just tried the same trick with good old Wordpad (which has an Unicode option even with W98) and it saved my test file without the BOM. Is saving a file as UTF-8 compatible with the iso-8859-1 meta tag? I'm not sure why would you want to do this, but here goes some reasoning on general principles. As long as the file is saved as uft-8 it contains the correctly encoded content and it's up to the browser to display it accordingly. Now, the primary source of encoding declaration for the browser is the HTTP header sent by the server along the document (this is the .htaccess stuff I mentioned), which should override every other directive, including the meta declaration. Thus, the browser should choose the correct encoding and display both the english and the vietnamese text. I don't recall anybody really testing browsers with that stuff, so you may be in for unexpected results here: if the browser ignores the rule and chooses to believe the meta directive instead of the header, it would display correctly the english part, but the vietnamese one would be a sequence of empty squares, question marks and/or best-guess ISO-8859-1 characters (two for every Unicode one). As too much things web-related, 'should' is a iffy thing to rely upon. More, if somebody saves that page to the disk and looks at it later, the only source of encoding information would be the meta stuff, with the same result as above... More generally, inputing characters not native to my keyboard/OS is to me the most annoying part of it all (I routinely have to input central-european stuff by switching the keyboard layout, meaning I had to remember which key becomes which). If you have the luck to get your content already typed, copy/paste is much more error-proof than the alternatives. djn begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
Re: [WSG] Other character sets/languages
Hi Dejan, You wrote: I'm not really sure, as the Notepad I got with Win98 doesn't offer anything but 'text file' and 'all files' Hmm. I didn't think about different versions of Windows. On my Windows XP, text file and all files are the choices for Save as type: and the chance to select the Encoding: is next below that. (The bottom of the Save As... dialog box is partly off screen at the bottom until I drag it up a bit.) ... save the page as utf-8, open it in Mozilla/Firefox and check the very first characters displayed. If there is no strange character there you know it's OK. I have heard of this but also read (somewhere) that later browsers from IE6 on have been fixed to not display characters from trying to show the BOM; as a result I thought nothing of the fact that I have not seen such a result in IE6 and Mozilla 1.7. Is saving a file as UTF-8 compatible with the iso-8859-1 meta tag? I'm not sure why would you want to do this, No reason, except that answers given on [WSG] concerning the meta tag often show iso-8859-1 and this thread on file encoding is aimed to UTF-8. I strongly suspected that both the meta declaration and the file encoding should agree. ... some reasoning on general principles. As long as the file is saved as uft-8 it contains the correctly encoded content and it's up to the browser to display it accordingly. Now, the primary source of encoding declaration for the browser is the HTTP header sent by the server along the document (this is the .htaccess stuff I mentioned), which should override every other directive, including the meta declaration. All of my efforts, so far, are stand-alone and intranet applications, so I don't know what to expect from actually having the file on a true server situation accessed from the Internet. Obviously, the fact that what I have been doing works locally does not mean everything is OK as to standards compliance. Thus, the browser should choose the correct encoding and display both the english and the vietnamese text. ... in for unexpected results here: if the browser ignores the rule and chooses to believe the meta directive instead of the header, it would display correctly the english part, but the vietnamese one would be a sequence of empty squares, question marks and/or best-guess ISO-8859-1 characters ... Urk! Fortunately, my files are English-language with a few #... codes for proper typographic punctuation and some characters in names coming from foreign languages, all typed on a US English keyboard. Nevertheless I assume my not complying with standards would, sooner or later, lead to some hard-to-untangle problems. More, if somebody saves that page to the disk and looks at it later, the only source of encoding information would be the meta stuff, ... Well, provided the browser doesn't cover up the problem as it does part of the time! LOL. My thanks to all who have contributed to my angle on this thread--the how to of getting the files right seems to have very little in the line of resources, unless, as I suggested, I just don't search the right terms. Regards, Gene Falck [EMAIL PROTECTED] ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Hi Gene, You wrote: the chance to select the Encoding: is next below that True. Windows started using Unicode as of Win2K. I was surprised indeed to find the Unicode option in Win98's Wordpad. I was surprised again today when opening in Unired a file saved as 'Unicode text' with Wordpad. Unired said it was no utf-8, it was utf-16 (Little Endian) instead, so sending it as utf-8 would be incorrect, even if Mozilla seemed not to care that much. I thought nothing of the fact that I have not seen such a result in IE6 and Mozilla 1.7. Mozilla 1.7.5 still proudly displays an ugly BOM, IE doesn't. All of my efforts, so far, are stand-alone and intranet applications, so I don't know what to expect from actually having the file on a true server situation accessed from the Internet. As long as you have a web server on your intranet it shouldn't do any difference to the browser, it's just documents coming from the network. It's files from your disk that will miss the http headers. Urk! Fortunately, my files are English-language with a few #... codes for proper typographic punctuation and some characters in names This works, but after a few characters it just becomes tiring ... One thing I've just thought of. The final hurdle in letting the world see vietnamese text is hoping that the visitor's browser has a font capable of displaying the text. There is not much you can do if it doesn't, but if it has one you should allow the browser to choose it avoiding to declare a font-family for that part of the page. djn begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
Re: [WSG] Other character sets/languages
woric wrote: Choose charset UTF-8 (not UTF-8 BOM) when saving. Can you explain the difference? In other words, the BOM is a funny character Unicode uses as the very first char in some of its encoding forms to declare which byte is which when characters are composed of more than 1 byte. As stated by the Unicode consortium itself, utf-8 does not need this, so the mark can be safely ignored when creating a utf-8 document (you can even delete it from an existing document without consequences). Using the BOM in a utf-8 webpage would have two unhappy outcomes: Gecko-based browsers would display the thing (not something you'd usually like), and IE would render the page in Quirks mode (as with every other character coming before the Doctype declaration). The second point is really related to the document language, not the character encoding. Declaring it properly (with html lang=en and div lang=vi) should help screen-readers read each part of the page with the correct pronunciation and search engines recognize the content language (eg. every localized Google has an option to search only documents in its native language). djn begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard
Re: [WSG] Other character sets/languages
Hi Dejan, You wrote: woric wrote: Choose charset UTF-8 (not UTF-8 BOM) when saving. Can you explain the difference? In other words, the BOM is a funny character Unicode uses as the very first char in some of its encoding forms to declare which byte is which when characters are composed of more than 1 byte. As stated by the Unicode consortium itself, utf-8 does not need this, so the mark can be safely ignored when creating a utf-8 document (you can even delete it from an existing document without consequences). Using the BOM in a utf-8 webpage would have two unhappy outcomes: Gecko-based browsers would display the thing (not something you'd usually like), and IE would render the page in Quirks mode (as with every other character coming before the Doctype declaration). OK, I understand about the BOM but this still leaves me wondering how to save properly. I usually code using Notepad which offers, from the Save As... menu choice, the Encoding options: ANSI Unicode Unicode big endian UTF-8 but no UTF-6 BOM. How can I be sure I am saving in the right way? In this matter, I am also wondering where using a meta tag specifying iso-8859-1 fits in terms of following the standards. I notice many people do this and I gather the actual coding of keystrokes (on a standard PC keyboard set up for US English) should be the same. Is saving a file as UTF-8 compatible with the iso-8859-1 meta tag? I have been checking in search engines and looking around in our [WSG] list resources, but I have concluded that I have no idea what to call my questions. Regards, Gene Falck [EMAIL PROTECTED] ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Thanks very much for that, Dejan. Choose charset UTF-8 (not UTF-8 BOM) when saving. Can you explain the difference? Don't forget to mark up properly the Vietnamese content with div lang=vi or such... Now the one easy thing about this project is that Vietnamese already contains all the unaccented roman letters. So I can set the whole page to be vietnamese I guess and it won't stop the English being English... Or would that cause a problem? Thanks again, Have You Validated Your Code? John Horner(+612 / 02) 9333 3488 Senior Developer, ABC Online http://www.abc.net.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Choose charset UTF-8 (not UTF-8 BOM) when saving. Can you explain the difference? Hi John, yes I'd be glad to explain the difference. When saving in UTF, a Byte Order Mark (or BOM) can be added to signify which type Unicode follows. The bad news is that the BOM may make the file unreadable to applications which are not Unicode aware; so when saving UTF8 you should only add a BOM if you know the application that will open the file can handle it. See http://www.unicode.org/faq/utf_bom.html#BOM for more details. woric ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
[WSG] Other character sets/languages
This is kind of embarrassing to admit, but for the very first time, I've undertaken to code a page (partially) in another language, and in another character set too, and I don't really know how to do it properly. And it's not just a matter of a few accents here and there -- the language is Vietnamese, which has all kinds of interesting double-diacritics and things like a crossed-out letter D (strikeD/strike would approximate it). So, where to start? The standards way to do it these days is with Unicode, right? In the old days we would have used one of the three different Vietnamese encodings -- TCVN, VPS or VISCII are what FireFox offers me -- but now Unicode should have done away with that stuff? So, do I code the page in UTF-8? I don't use a special Vietnamese encoding? And, no matter what you guys tell me, as I don't read the language, someone else will supply me with the text, and I can only pray it's from a Unicode-compliant source? I tried to educate myself about Unicode by reading Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html which was very entertaining, but I'm not sure I got it or I wouldn't be asking... Have You Validated Your Code? John Horner(+612 / 02) 9333 3488 Senior Developer, ABC Online http://www.abc.net.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list getting help **
Re: [WSG] Other character sets/languages
Hi John, Unicode is today the most foolproof way of sending internationalized characters to modern browsers. I use Unired for the purpose: http://www.esperanto.mv.ru/UniRed/ENG/ It's free and it works fine to boot. You should be able to copy/paste into your HTML from Word, PDF and anything that can display Vietnamese characters. Choose charset UTF-8 (not UTF-8 BOM) when saving. Next you need to tell the browser about the encoding. The standard compliant way is to use http headers. On Apache just add a line with 'AddDefaultCharset utf-8' to your .htaccess. Not sure about other kinds of server. Just to be safe put 'meta http-equiv=Content-Type content=text/html; charset=utf-8'into the head of the document (as soon in the source as possible). Don't forget to mark up properly the Vietnamese content with div lang=vi or such... Well, that's more or less all. djn John Horner wrote: So, do I code the page in UTF-8? I don't use a special Vietnamese encoding? begin:vcard fn:Dejan Kozina n:Kozina;Dejan org:Dejan Kozina Web Design Studio adr:;;Dolina 346;Dolina;TS;I-34018;Italy email;internet:[EMAIL PROTECTED] tel;work:+39 348 7355 225 tel;fax:+39 040 228 436 tel;cell:+39 348 7355 225 x-mozilla-html:TRUE url:http://www.kozina.com/ version:2.1 end:vcard