Re: [WSG] Other character sets/languages

2005-02-28 Thread Dejan Kozina
Gene Falck wrote:
Do you suppose Microsoft fixed Notepad when they
coded Windows XP?
Yes, it's pretty safe to assume that enhancements to Notepad do not get 
their own press release ...

AFAIK, **all** my files
are missing the http headers
Correct, http headers are only sent by a web server. That said, 
installing Apache on Windows is quite simple, as long as you have an 
administrator account. Download it from 
http://www.apache.org/dyn/closer.cgi/httpd/binaries/win32/ (choose the 
apache_1.3.33-win32-x86-no_src.msi file), launch the installer, supply a 
domain name (localhost is a safe choice), a (whichever) email address 
and you are ready to go. Start the server, point your browser to 
http://localhost and a welcome page will appear. If you go to Apache's 
htdocs subdirectory, throw away any content and put your files there, 
refreshing your browser will display your very own index.htm. That's 
more or less all. Keep the installer for when you're going to uninstall 
Apache.

To check the http headers you can download the standalone ViewHead from 
http://www.pc-tools.net/win32/viewhead, or install a Mozilla extension 
from http://livehttpheaders.mozdev.org (after installing and restarting 
the browser, rightclick, select View Page Info and then the Headers tab).

After a while, you'll feel ready to play with the various config 
options. These are stored in a textfile called httpd.conf in Apache's 
conf subdirectory. Follow the instructions within the file, restart the 
server to apply the changes and have fun. Almost everything that works 
on Windows will work the same way on a Linux/Unix web server, so you may 
safely test at home before applying to a production server.

Should you need more instructions, a default install will put a lot of 
useful content at http://localhost/manual.

djn
--
Dejan Kozina
Dolina 346 (TS) - I-34018 Italy
tel./fax: +39 040 228 436 - cell.: +39 348 7355 225
http://www.kozina.com/  - e-mail: [EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-27 Thread Gene Falck
Hi Dejan,
You wrote:
I thought nothing of the fact that I have
not seen such a result in IE6 and Mozilla 1.7.
Mozilla 1.7.5 still proudly displays an ugly BOM,
IE doesn't.
Hmm--very interesting. I have not seen any BOM
effects even though I use Mozilla at home (IE6
at work) so I downloaded XVI32 and checked some
of my files composed and saved in Notepad, some
with ctrl-s and some using Save as, choosing the
UTF-8 encoding, and have yet to find one with a
BOM at the beginning.
Do you suppose Microsoft fixed Notepad when they
coded Windows XP?
As long as you have a web server on your intranet
it shouldn't do any difference to the browser,
it's just documents coming from the network. It's
files from your disk that will miss the http headers.
The setup at work was never intended to serve
HTML. We have a program that runs things like
payroll, work scheduling, and inventory that
runs on the LAN; we also use the F:\ drive bit
to share Excel and Word files. So, I can use
an HTML file from a floppy disk, the C:\ drive,
the F:\ drive, passed to me as an internal
email attachment, or even from a flash memory
unit on a USB plug in. AFAIK, **all** my files
are missing the http headers.
Regards,
Gene Falck,
[EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


RE: [WSG] Other character sets/languages

2005-02-25 Thread Richard Ishida
Oops. Of course that URI should have read:

http://www.w3.org/International/technique-index#language


 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Richard Ishida
 Sent: 25 February 2005 08:30
 To: wsg@webstandardsgroup.org
 Subject: RE: [WSG] Other character sets/languages
 
 John,
 
 You should indeed declare the page to be Vietnamese, and if 
 there are English passages or phrases embedded in the file 
 you should declare those to be English on the elements that 
 surround them.
 
 For an explanation of this, see our new techniques index at 
 http://localhost/International/technique-index#language (note 
 that this allows you to drill down to 2 further levels of detail).
 
 RI
 
 
 
 Richard Ishida
 W3C
 
 contact info:
 http://www.w3.org/People/Ishida/ 
 
 W3C Internationalization:
 http://www.w3.org/International/ 
 
 Publication blog:
 http://people.w3.org/rishida/blog/

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Other character sets/languages

2005-02-25 Thread Dejan Kozina
Well, http://www.w3.org/International/technique-index#language I guess.
djn
Richard Ishida wrote:
http://localhost/International/technique-index#language 
begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



RE: [WSG] Other character sets/languages

2005-02-22 Thread Richard Ishida
Hello Lea,

I note that you used incorrect syntax for your CSS declarations - ending
declarations with ':' rather than ';'.  I assume this is just a typo in this
message, rather than the potential source of the problems you had, since in
a CSS file it would generally cause the declaration to fail.

RI



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 
 

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Lea de Groot
 Sent: 21 February 2005 21:05
 To: wsg@webstandardsgroup.org
 Subject: RE: [WSG] Other character sets/languages
 
 On Mon, 21 Feb 2005 09:43:40 -, Richard Ishida wrote:
  In any case you should always finish a font-family declaration with 
  'serif' or 'sans-serif' in this situation.  Then if none of 
 the fonts 
  you indicated are on the user's system, a font that they do 
 have will 
  be used.
 
 Caveat alert!
 Errr, sort of an inverse caveat, if you take this too far.
 I had a site where I thought 'I do not care what font this 
 part appears in, let them choose which serif font it has and used:
 #block {font-family: serif: }
 Bad move :(
 Some versions of IE (some V6 variant IIRC) showed a lovely 
 set of black square blocks instead of text. :( We checked the 
 browser and it didn't have a bizarre selection as its default font.
 Changing the declaration to a simple:
 #block {font-family: Times, serif: }
 fixed the problem.
 
 FYI
 Lea
 --
 Lea de Groot
 Elysian Systems - I Understand the Internet 
 http://elysiansystems.com/ Search Engine Optimisation, 
 Usability, Information Architecture, Web Design Brisbane, Australia
 **
 The discussion list for  http://webstandardsgroup.org/
 
  See http://webstandardsgroup.org/mail/guidelines.cfm
  for some hints on posting to the list  getting help
 **
 

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Other character sets/languages

2005-02-22 Thread Lea de Groot
On Tue, 22 Feb 2005 08:31:09 -, Richard Ishida wrote:
 I note that you used incorrect syntax for your CSS declarations - ending
 declarations with ':' rather than ';'.  I assume this is just a typo in this
 message, rather than the potential source of the problems you had, since in
 a CSS file it would generally cause the declaration to fail.

Ah, yeah, its a typo - I didn't cut and paste, but typed it from 
memory; this was a while ago :)
Thanks for the pickup.
(Just between you, me, and the other 1000 members of the list, I make 
that typo about once per project, mostly in PHP, so I catch it fairly 
quickly ;))

warmly
Lea
-- 
Lea de Groot
Elysian Systems - I Understand the Internet http://elysiansystems.com/
Search Engine Optimisation, Usability, Information Architecture, Web 
Design
Brisbane, Australia
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Other character sets/languages

2005-02-21 Thread Richard Ishida
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina
 Sent: 20 February 2005 22:46
 To: wsg@webstandardsgroup.org
 Subject: Re: [WSG] Other character sets/languages
 

 More generally, inputing characters not native to my 
 keyboard/OS is to me the most annoying part of it all (I 
 routinely have to input central-european stuff by switching 
 the keyboard layout, meaning I had to remember which key 
 becomes which). If you have the luck to get your content 
 already typed, copy/paste is much more error-proof than the 
 alternatives.

Then you might like these pickers - designed for non-native user input. (Note 
that the Latin  diacritics picker probably includes most of what's needed for 
Vietnamese.)

http://people.w3.org/rishida/scripts/pickers/



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Other character sets/languages

2005-02-21 Thread Richard Ishida



 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Dejan Kozina
 Sent: 21 February 2005 04:49

 One thing I've just thought of. The final hurdle in letting the world 
 see vietnamese text is hoping that the visitor's browser has a font 
 capable of displaying the text. There is not much you can do if it 
 doesn't, but if it has one you should allow the browser to choose it 
 avoiding to declare a font-family for that part of the page.

Most likely, people who want to read (not look at) Vietnamese text will have 
fonts that support the characters.  

Note also that you can specify your prefered font in the CSS, but the 
font-family property allows you to specify more than one font for fallback 
support. For example, if you research the user base and discover that there are 
two or three Unicode fonts in common use, you can include them all.  In any 
case you should always finish a font-family declaration with 'serif' or 
'sans-serif' in this situation.  Then if none of the fonts you indicated are on 
the user's system, a font that they do have will be used.

eg. body { font-family: My preferred viet font, An alternative font, 
sans-serif; ... }

hth
RI



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Other character sets/languages

2005-02-21 Thread Richard Ishida

 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck
 Sent: 20 February 2005 04:26

 OK, I understand about the BOM but this still leaves me 
 wondering how to save properly. I usually code using Notepad 
 which offers, from the Save As... menu choice, the Encoding options:
 
 ANSI
 Unicode
 Unicode big endian
 UTF-8
 
 but no UTF-6 BOM. How can I be sure I am saving in the right way?
 


People on the list may also find the following resource useful. It indicates
how to save files in UTF-8 from a number of different editing environments.

Setting encoding in web authoring applications
http://www.w3.org/International/questions/qa-setting-encoding-in-application
s



Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Other character sets/languages

2005-02-21 Thread Dejan Kozina
Richard Ishida wrote:
In any case you should always finish a
font-family declaration with 'serif' or 'sans-serif' in this
situation.  Then if none of the fonts you indicated are on the user's
system, a font that they do have will be used.
Good point.
Lesson learned: I really shouldn't write heady stuff before sunrise and 
a fair serving of coffee.
What I had in mind was rather the case (admittedly rare, but happened to 
me) when a non-Unicode font has the same name as a Unicode one. The 
culprit in my case was Georgia with CE characters, back then when W2k 
was brand new. Made a website assuming every Georgia has the full set of 
Latin glyphs, while my customer had an Italian Win98 supplied with a 
Win-1252 Georgia... Still hate those empty squares.
Researching the user base is something I find iffy anyway. Every once in 
a while there is a thread trying to find a safe sequence of fonts usable 
both on Windows and MacOS, and it ends up with a boatload of different 
typefaces, plus assorted arguments about display details. Directly 
asking a vietnamese designer might be more straightforward.
Anyway, my suggestion should be more correctly amended to: 'use a 
generic font-family and let the browser help itself, rather than risk a 
miss trying to overdesign the appearance'.

djn
begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



RE: [WSG] Other character sets/languages

2005-02-21 Thread John Horner
Then you might like these pickers - designed for non-native user 
input. (Note that the Latin  diacritics picker probably includes 
most of what's needed for Vietnamese.)

http://people.w3.org/rishida/scripts/pickers/
Thanks for that, very useful. I was skeptical, Vietnamese having such 
a wide variety of accents, double-accents, and even accents below as 
well as above the letter, but I was pleasantly surprised. I think 
they're all there and any set that includes the letter O with a 
little comma sticking out of the side plus a teeny question mark 
floating over the top (as seen in everyone's favourite Vietnamese 
word, Ph) seems to be pretty much complete.

Thanks again everyone for your help. I'll let you look at the website 
when it's done.

Oh and incidentally, the Vietnamese Professionals Society are the 
body that looks after this kind of thing, fonts, keyboard layouts and 
so on, and they use and recommend Unicode here: 
http://www.vps.org/rubrique.php3?id_rubrique=91 so they're solidly on 
board with standards too.



   Have You Validated Your Code?
John Horner(+612 / 02) 9333 3488
Senior Developer, ABC Online  http://www.abc.net.au/

**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


RE: [WSG] Other character sets/languages

2005-02-20 Thread Richard Ishida
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck
 Sent: 20 February 2005 04:26

 OK, I understand about the BOM but this still leaves me 
 wondering how to save properly. I usually code using Notepad 
 which offers, from the Save As... menu choice, the Encoding options:
 
 ANSI
 Unicode
 Unicode big endian
 UTF-8
 
 but no UTF-6 BOM. How can I be sure I am saving in the right way?

I think you need to use a different editor, or (as I do) strip the BOM off
before publishing.

You may also find the following article useful. It explains the BOM and the
effects it can sometimes have on pages when present:
http://www.w3.org/International/questions/qa-utf8-bom
FAQ: Unexpected characters or blank lines


Here is the code of a Perl script I use to strip the BOM.  It's just a quick
hack, nothing beautiful, but it may help you or others when you cannot avoid
saving with a BOM.  (I call it by invoking a batch file in my Windows
directory: removebom filename.)

===
# program to remove a leading UTF-8 BOM from a file
# works both STDIN - STDOUT and on the spot (with filename as argument)

if ($#ARGV  0) {
print STDERR Too many arguments!\n;
exit;
}

my @file;   # file content
my $lineno = 0;

my $filename = @ARGV[0];
if ($filename) {
open( BOMFILE, $filename ) || die Could not open source file for
reading.;
while (BOMFILE) {
if ($lineno++ == 0) {
if ( index( $_, '?' ) == 0 ) {
s/^\xEF\xBB\xBF//;
print BOM found and removed.\n;
}
else { print No BOM found.\n; }
}
push @file, $_ ;
}
close (BOMFILE)  || die Can't close source file after reading.;

open (NOBOMFILE, $filename) || die Could not open source file
for writing.;
foreach $line (@file) {
print NOBOMFILE $line;
}
close (NOBOMFILE)  || die Can't close source file after writing.;
}
else {  # STDIN - STDOUT
while () {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
foreach $line (@file) {
print $line;
}
}
===

HTH
RI


Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 

Publication blog:
http://people.w3.org/rishida/blog/
 





**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Other character sets/languages

2005-02-20 Thread Richard Ishida

 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Gene Falck
 Sent: 20 February 2005 04:26

 In this matter, I am also wondering where using a meta tag 
 specifying iso-8859-1 fits in terms of following the 
 standards. I notice many people do this and I gather the 
 actual coding of keystrokes (on a standard PC keyboard set up 
 for US English) should be the same. Is saving a file as UTF-8 
 compatible with the iso-8859-1 meta tag?


Nope.  Please save the file in the same encoding as you declare it to be in
the meta statement.

This seems to be such a common question/mistake that the W3C is beginning to
write an article on the subject. 

The basic ASCII set of characters (ie. the first 127 characters) use the
same bytes in iso 8895-1 and utf-8, but as soon as you include a copyright
sign, an accented character, etc, you will have problems.  Besides which, it
is always better to be consistent anyway, and doesn't cost much.

hth
RI

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Other character sets/languages

2005-02-20 Thread Jan Brasna
I usually code using Notepad
Better use something like PSPad wchich offers you the choice not to 
include these ident. bytes.

file as UTF-8 compatible with the iso-8859-1 meta tag?
Eh, nope. If you start using non-ASCII characters (curly quotes etc.) it 
would break the page...

--
Jan Brasna aka JohnyB :: alphanumeric.cz | janbrasna.com
Stop IE! - http://www.stopie.com/ | http://browsehappy.com/
**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-20 Thread Dejan Kozina
Gene wrote:
I usually code using
 Notepad which offers, from the Save As... menu choice,
 the Encoding options:
I'm not really sure, as the Notepad I got with Win98 doesn't offer 
anything but 'text file' and 'all files'(Win98 doesn't do Unicode). What 
you can try is to save the page as utf-8, open it in Mozilla/Firefox and 
check the very first characters displayed. If there is no strange 
character there you know it's OK. I just tried the same trick with good 
old Wordpad (which has an Unicode option even with W98) and it saved my 
test file without the BOM.

 Is saving a
 file as UTF-8 compatible with the iso-8859-1 meta tag?
I'm not sure why would you want to do this, but here goes some reasoning 
on general principles. As long as the file is saved as uft-8 it contains 
the correctly encoded content and it's up to the browser to display it 
accordingly. Now, the primary source of encoding declaration for the 
browser is the HTTP header sent by the server along the document (this 
is the .htaccess stuff I mentioned), which should override every other 
directive, including the meta declaration. Thus, the browser should 
choose the correct encoding and display both the english and the 
vietnamese text. I don't recall anybody really testing browsers with 
that stuff, so you may be in for unexpected results here: if the browser 
ignores the rule and chooses to believe the meta directive instead of 
the header, it would display correctly the english part, but the 
vietnamese one would be a sequence of empty squares, question marks 
and/or best-guess ISO-8859-1 characters (two for every Unicode one). As 
too much things web-related, 'should' is a iffy thing to rely upon.
More, if somebody saves that page to the disk and looks at it later, the 
only source of encoding information would be the meta stuff, with the 
same result as above...

More generally, inputing characters not native to my keyboard/OS is to 
me the most annoying part of it all (I routinely have to input 
central-european stuff by switching the keyboard layout, meaning I had 
to remember which key becomes which). If you have the luck to get your 
content already typed, copy/paste is much more error-proof than the 
alternatives.

djn
begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



Re: [WSG] Other character sets/languages

2005-02-20 Thread Gene Falck
Hi Dejan,
You wrote:
I'm not really sure, as the Notepad I got with Win98
doesn't offer anything but 'text file' and 'all files'
Hmm. I didn't think about different versions of
Windows. On my Windows XP, text file and all files
are the choices for Save as type: and the chance
to select the Encoding: is next below that. (The
bottom of the Save As... dialog box is partly off
screen at the bottom until I drag it up a bit.)
... save the page as utf-8, open it in Mozilla/Firefox
and check the very first characters displayed. If there
is no strange character there you know it's OK.
I have heard of this but also read (somewhere) that
later browsers from IE6 on have been fixed to not
display characters from trying to show the BOM; as
a result I thought nothing of the fact that I have
not seen such a result in IE6 and Mozilla 1.7.
 Is saving a file as UTF-8 compatible with the
 iso-8859-1 meta tag?
I'm not sure why would you want to do this,
No reason, except that answers given on [WSG]
concerning the meta tag often show iso-8859-1
and this thread on file encoding is aimed to
UTF-8. I strongly suspected that both the meta
declaration and the file encoding should agree.
... some reasoning on general principles. As long
as the file is saved as uft-8 it contains the
correctly encoded content and it's up to the browser
to display it accordingly. Now, the primary source
of encoding declaration for the browser is the HTTP
header sent by the server along the document (this
is the .htaccess stuff I mentioned), which should
override every other directive, including the meta
declaration.
All of my efforts, so far, are stand-alone and
intranet applications, so I don't know what to
expect from actually having the file on a true
server situation accessed from the Internet.
Obviously, the fact that what I have been doing
works locally does not mean everything is OK as
to standards compliance.
Thus, the browser should choose the correct encoding
and display both the english and the vietnamese text.
... in for unexpected results here: if the browser
ignores the rule and chooses to believe the meta
directive instead of the header, it would display
correctly the english part, but the vietnamese one
would be a sequence of empty squares, question marks
and/or best-guess ISO-8859-1 characters ...
Urk! Fortunately, my files are English-language
with a few #... codes for proper typographic
punctuation and some characters in names coming
from foreign languages, all typed on a US English
keyboard. Nevertheless I assume my not complying
with standards would, sooner or later, lead to
some hard-to-untangle problems.
More, if somebody saves that page to the disk and
looks at it later, the only source of encoding
information would be the meta stuff, ...
Well, provided the browser doesn't cover up the
problem as it does part of the time! LOL.
My thanks to all who have contributed to my angle on
this thread--the how to of getting the files right
seems to have very little in the line of resources,
unless, as I suggested, I just don't search the right
terms.
Regards,
Gene Falck
[EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-20 Thread Dejan Kozina
Hi Gene,
 You wrote:
the chance to select the Encoding: is next below that
True. Windows started using Unicode as of Win2K. I was surprised indeed 
to find the Unicode option in Win98's Wordpad. I was surprised again 
today when opening in Unired a file saved as 'Unicode text' with 
Wordpad. Unired said it was no utf-8, it was utf-16 (Little Endian) 
instead, so sending it as utf-8 would be incorrect, even if Mozilla 
seemed not to care that much.

I thought nothing of the fact that I have
not seen such a result in IE6 and Mozilla 1.7.
Mozilla 1.7.5 still proudly displays an ugly BOM, IE doesn't.
All of my efforts, so far, are stand-alone and
intranet applications, so I don't know what to
expect from actually having the file on a true
server situation accessed from the Internet.
As long as you have a web server on your intranet it shouldn't do any 
difference to the browser, it's just documents coming from the network. 
It's files from your disk that will miss the http headers.


Urk! Fortunately, my files are English-language
with a few #... codes for proper typographic
punctuation and some characters in names
This works, but after a few characters it just becomes tiring ...
One thing I've just thought of. The final hurdle in letting the world 
see vietnamese text is hoping that the visitor's browser has a font 
capable of displaying the text. There is not much you can do if it 
doesn't, but if it has one you should allow the browser to choose it 
avoiding to declare a font-family for that part of the page.

djn
begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



Re: [WSG] Other character sets/languages

2005-02-19 Thread Dejan Kozina
woric wrote:
Choose charset UTF-8 (not UTF-8 BOM) when saving.
Can you explain the difference?
In other words, the BOM is a funny character Unicode uses as the very 
first char in some of its encoding forms to declare which byte is which 
when characters are composed of more than 1 byte. As stated by the 
Unicode consortium itself, utf-8 does not need this, so the mark can be 
safely ignored when creating a utf-8 document (you can even delete it 
from an existing document without consequences). Using the BOM in a 
utf-8 webpage would have two unhappy outcomes: Gecko-based browsers 
would display the thing (not something you'd usually like), and IE would 
render the page in Quirks mode (as with every other character coming 
before the Doctype declaration).

The second point is really related to the document language, not the 
character encoding. Declaring it properly (with html lang=en and 
div lang=vi) should help screen-readers read each part of the page 
with the correct pronunciation and search engines recognize the content 
language (eg. every localized Google has an option to search only 
documents in its native language).

djn

begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard



Re: [WSG] Other character sets/languages

2005-02-19 Thread Gene Falck
Hi Dejan,
You wrote:
woric wrote:
Choose charset UTF-8 (not UTF-8 BOM) when saving.
Can you explain the difference?
In other words, the BOM is a funny character Unicode uses as the very 
first char in some of its encoding forms to declare which byte is which 
when characters are composed of more than 1 byte. As stated by the Unicode 
consortium itself, utf-8 does not need this, so the mark can be safely 
ignored when creating a utf-8 document (you can even delete it from an 
existing document without consequences). Using the BOM in a utf-8 webpage 
would have two unhappy outcomes: Gecko-based browsers would display the 
thing (not something you'd usually like), and IE would render the page in 
Quirks mode (as with every other character coming before the Doctype 
declaration).
OK, I understand about the BOM but this still leaves me
wondering how to save properly. I usually code using
Notepad which offers, from the Save As... menu choice,
the Encoding options:
ANSI
Unicode
Unicode big endian
UTF-8
but no UTF-6 BOM. How can I be sure I am saving in the
right way?
In this matter, I am also wondering where using a meta
tag specifying iso-8859-1 fits in terms of following the
standards. I notice many people do this and I gather the
actual coding of keystrokes (on a standard PC keyboard
set up for US English) should be the same. Is saving a
file as UTF-8 compatible with the iso-8859-1 meta tag?
I have been checking in search engines and looking around
in our [WSG] list resources, but I have concluded that I
have no idea what to call my questions.
Regards,
Gene Falck
[EMAIL PROTECTED]
**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-17 Thread John Horner
Thanks very much for that, Dejan.
Choose charset UTF-8 (not UTF-8 BOM) when saving.
Can you explain the difference?
Don't forget to mark up properly the Vietnamese content with div 
lang=vi or such...
Now the one easy thing about this project is that Vietnamese already 
contains all the unaccented roman letters. So I can set the whole 
page to be vietnamese I guess and it won't stop the English being 
English... Or would that cause a problem?

Thanks again,

   Have You Validated Your Code?
John Horner(+612 / 02) 9333 3488
Senior Developer, ABC Online  http://www.abc.net.au/

**
The discussion list for  http://webstandardsgroup.org/
See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-17 Thread woric
 Choose charset UTF-8 (not UTF-8 BOM) when saving.

 Can you explain the difference?

Hi John,

yes I'd be glad to explain the difference.

When saving in UTF, a Byte Order Mark (or BOM) can be added to signify
which type Unicode follows.

The bad news is that the BOM may make the file unreadable to applications
which are not Unicode aware; so when saving UTF8 you should only add a BOM
if you know the application that will open the file can handle it.

See http://www.unicode.org/faq/utf_bom.html#BOM for more details.

woric



**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



[WSG] Other character sets/languages

2005-02-16 Thread John Horner
This is kind of embarrassing to admit, but for the very first time, 
I've undertaken to code a page (partially) in another language, and 
in another character set too, and I don't really know how to do it 
properly.

And it's not just a matter of a few accents here and there -- the 
language is Vietnamese, which has all kinds of interesting 
double-diacritics and things like a crossed-out letter D 
(strikeD/strike would approximate it).

So, where to start?
The standards way to do it these days is with Unicode, right? In the 
old days we would have used one of the three different Vietnamese 
encodings -- TCVN, VPS or VISCII are what FireFox offers me -- but 
now Unicode should have done away with that stuff?

So, do I code the page in UTF-8? I don't use a special Vietnamese encoding?
And, no matter what you guys tell me, as I don't read the language, 
someone else will  supply me with the text, and I can only pray it's 
from a Unicode-compliant source?

I tried to educate myself about Unicode by reading Joel Spolsky's 
The Absolute Minimum Every Software Developer Absolutely, Positively 
Must Know About Unicode and Character Sets (No Excuses!) 
http://www.joelonsoftware.com/articles/Unicode.html which was very 
entertaining, but I'm not sure I got it or I wouldn't be asking...

   Have You Validated Your Code?
John Horner(+612 / 02) 9333 3488
Senior Developer, ABC Online  http://www.abc.net.au/

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**


Re: [WSG] Other character sets/languages

2005-02-16 Thread Dejan Kozina
Hi John,
Unicode is today the most foolproof way of sending internationalized 
characters to modern browsers. I use Unired for the purpose: 
http://www.esperanto.mv.ru/UniRed/ENG/
It's free and it works fine to boot. You should be able to copy/paste 
into your HTML from Word, PDF and anything that can display Vietnamese 
characters. Choose charset UTF-8 (not UTF-8 BOM) when saving.

Next you need to tell the browser about the encoding. The standard 
compliant way is to use http headers. On Apache just add a line with 
'AddDefaultCharset utf-8' to your .htaccess. Not sure about other kinds 
of server. Just to be safe put 'meta http-equiv=Content-Type 
content=text/html; charset=utf-8'into the head of the document (as 
soon in the source as possible).

Don't forget to mark up properly the Vietnamese content with div 
lang=vi or such...

Well, that's more or less all.
djn
John Horner wrote:
So, do I code the page in UTF-8? I don't use a special Vietnamese encoding?
begin:vcard
fn:Dejan Kozina
n:Kozina;Dejan
org:Dejan Kozina Web Design Studio
adr:;;Dolina 346;Dolina;TS;I-34018;Italy
email;internet:[EMAIL PROTECTED]
tel;work:+39 348 7355 225
tel;fax:+39 040 228 436
tel;cell:+39 348 7355 225
x-mozilla-html:TRUE
url:http://www.kozina.com/
version:2.1
end:vcard