[abcusers] Nastiness. [Was Unicode]

2004-04-29 Thread Christian M. Cepel
Steven Bennett wrote:
Christian M. Cepel wrote:
 

It was my understanding that all unicode character sets contain English
characters mapped to the same values they're mapped to in other sets.
   

Close -- Unicode is a *single* character set.  For convenience, you'll
frequently run into references to Unicode code pages, but all they are is a
range within the overall character set.  All characters from every encoding
that Unicode supports exist somewhere in that character set.
So with a Unicode (UTF-8 or UTF-16) encoded text file you could easily have
English, Chinese, Korean, Russian, and Symbol characters all in the same
sentence.
 

You know, about 3 years ago while in a SoftEng class, I started 
Thistledowne, and voiced my intentions to make it unicode16 native.  I 
was SUPER rudely kicked in the nuts by some on this list and then thrown 
in the doghouse while a major flamewar resulted.. well not a flamewar 
exactly.  I was the target, and was bombed without mercy.   I was told 
Hey stupid, ABC is strictly 7bit ascii, and there's damn good reasons 
why it's that way, so wanting to use Unicode is stupid and you should 
kill yourself for even thinking of it.

God people were mean and rude and nasty, along with the typical 
Oh...Yet another abc project...  And you're excited... Tell me what's 
gonna make your project shine over the hundreds of projects done by 
people who are probably better than you.  Go jump in a lake.  response.

Oh I continued my project, and got an A, and scrapped it and will used 
what I learned there for my new project.   Boy, I learned never to tell 
people on the list I had a project going.   Sure a few were encouraging, 
but who could hear their voice over the nastiness.

//Christian
Another convenient item is that the first Unicode code page 0x0001 - 0x007f
is the ASCII code.  So if you're using wchar instead of char as your string
pointer type, then comparisons like:
   if (str[0] == 'K')
...will work the same when using Unicode or ASCII.  The only difference is
now str points to an array of 16 bit values instead of 8 bit ones.
--Steve Bennett
To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html
 


--
 //Christian
Christian Marcus Cepel| And the wrens have returned 
[EMAIL PROTECTED] icq:12384980  | are nesting; In the hollow of
371 Crown Point, Columbia, MO | that oak where his heart once
65203-2202 573.999.2370   | had been; And he lifts up his
Computer Support Specialist, Sr.  | arms in a blessing; For being
University of Missouri - Columbia | born again.--Rich Mullins
To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] File names

2004-04-29 Thread John Chambers
Jack Campin writes:
|
| One problem: what if you want to mix character sets in a tune? -
| e.g. to have a Chinese song documented in English? (T: and w:
| fields in Chinese, N: and D: fields in English).

What  I'd  more  likely  want  to  do  is:  three  T: fields (Chinese
characters,  pinyin,  and  English),  one  C:  line  (characters  and
pinyin), and two or three w: lines.

Even more fun is the trad Yiddish/Hebrew/Arabic music, where you want
the  original (left-right), a transliteration (right-left), and maybe
an English set of words at times.  This is easy  inside  a  computer,
where  all alphabets have the same order (byte 0, 1, 2, ...) but it's
not always easy to find a really good layout on screen or paper.

There's a semi-standard way to do left-right music  with  the  lyrics
underneath  with  each  syllable in right-left form.  It's painful to
read, but you get used to it.  Of course, all of these languages have
been  printed  with the music in mirror-image form for centuries, but
that doesn't help when you want the lyrics in two different alphabets
that go in different directions.

It's probably good that the  Greeks  dropped  their  zig-zag  writing
scheme before modern music notation was developed ...

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] File names

2004-04-29 Thread Stephen Kellett
In message [EMAIL PROTECTED], Phil 
Taylor [EMAIL PROTECTED] writes
On 29 Apr 2004, at 00:32, Steven Bennett wrote:
According to Apple docs (I'll take their word for it... ;):
0x2028 -- Unicode line separator
0x2029 -- Unicode paragraph separator
Thank you Steve,
Pardon my ignorance, but how do you know that you're dealing with 
Unicode
here, rather than the ascii  ( and  )?
I guess its a problem for some charsets, but for Western ones, the high 
byte of the two will be NULL. Thus you can scan the text and if you find 
NULL Bytes before the end of the string (I assume you know your string 
length) followed by a non-NULL byte you can assume its Unicode.

abc is the characters 65, 66, 67
abc in ASCII is 0x41, 0x42, 0x43
abc in Unicode is 0x00,0x41, 0x00,0x42, 0x00,0x43 but in 16 bit lumps 
rather than 8 bit.

Not an ideal solution, but for western charsets this test has not failed 
me yet. Note, I don't do much internationalised code, but there are 
places where I need to make a reasonable guess (I write debugging tools 
and don't know for sure what data will be presented to me ahead of 
time), it works.

For people working with international character sets this trivial test 
may well fail in some cases.

Stephen
--
Stephen Kellett
Object Media Limitedhttp://www.objmedia.demon.co.uk
RSI Information:http://www.objmedia.demon.co.uk/rsi.html
To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] File names

2004-04-29 Thread Christian M. Cepel
Phil Taylor wrote:
On 29 Apr 2004, at 08:34, Stephen Kellett wrote:
In message [EMAIL PROTECTED], 
Phil Taylor [EMAIL PROTECTED] writes

On 29 Apr 2004, at 00:32, Steven Bennett wrote:
According to Apple docs (I'll take their word for it... ;):
0x2028 -- Unicode line separator
0x2029 -- Unicode paragraph separator

Thank you Steve,
Pardon my ignorance, but how do you know that you're dealing with 
Unicode
here, rather than the ascii  ( and  )?

I guess its a problem for some charsets, but for Western ones, the 
high byte of the two will be NULL. Thus you can scan the text and if 
you find NULL Bytes before the end of the string (I assume you know 
your string length) followed by a non-NULL byte you can assume its 
Unicode.

abc is the characters 65, 66, 67
abc in ASCII is 0x41, 0x42, 0x43
abc in Unicode is 0x00,0x41, 0x00,0x42, 0x00,0x43 but in 16 bit lumps 
rather than 8 bit.

OK, I understand that.  What was bothering me though, is how Steven 
B's parser is going to deal with regular ascii strings which include a 
space followed by a bracket.  It's no problem when everything is 
unicode, or everything is ascii, but if we are to have ascii abc which 
may include unicode strings, we will need a way of indicating this to 
the parser, will we not?

Phil Taylor
To subscribe/unsubscribe, point your browser to: 
http://www.tullochgorm.com/lists.html

Would not a charset specifier be a good addition?  (if there is already 
such, I shall be most embarrassed... as I am pretty much every day).  A 
rule such as, if you use something specific to a charset, you must 
specify it otherwise expect it to be 7bit ascii and display wrongly.

--
 //Christian
Christian Marcus Cepel| And the wrens have returned 
[EMAIL PROTECTED] icq:12384980  | are nesting; In the hollow of
371 Crown Point, Columbia, MO | that oak where his heart once
65203-2202 573.999.2370   | had been; And he lifts up his
Computer Support Specialist, Sr.  | arms in a blessing; For being
University of Missouri - Columbia | born again.--Rich Mullins
To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] reusable parser

2004-04-29 Thread Jack Campin
 Perhaps this is a good time to bring up the idea of a central set of 
 parser test cases and test case fragments.  In the past a number of 
 list members have mentioned the desire to have a corporate body of test 
 cases that could be used during testing and development of abc parsers. 
 Perhaps this open source project could be a forum to officially start
 collecting those test cases.

I've given several developers here copies of my CD-ROMs, as they're
large, carefully transcribed collections, intended to be used far
into the future, and where any idiosyncrasies are deliberate.  The
dialect of ABC I use is basically BarFly-with-the-bugs-fixed and
without using some of the more arcane stuff.  The idea is that if any
incompatibilities should arise in futurw, it'll be clear enough what
I meant that any musically knowledgeable user can fix the problem.
The one important place where I've gone beyond anything BarFly has
at present is in using part playing order for multivoice pieces with
the same semantic model as in the 1.6 standard for monophonic ones.
It's obvious what I mean but no software can interpret it yet; I have
no intention of altering it until someone comes up with a syntax that
expresses what I want, and meanwhile there are hundreds of CD-ROMs
floating about with this construct done my way.

There's a sample on my Music of Dalkeith website which has a bunch
of ABC tunes with MIDI, QuickTime and GIF tadpole equivalents that I've
generated and proofed myself, so the intended semantics is publicly
available.  I would suggest that test cases be documented the same way.


 This brings up another design/requirements issue when constructing this 
 parser:  to what degree should the parser be lenient with non-standard 
 abc usage?

For handling a large corpus (and preferably the entire corpus) you don't
just want to be lenient, you want to provide error handling that will
help a client disambiguate almost anything the most misinformed newbie
might have tried with the aid of the least standard software or none at
all.  The musical content might still be valuable and the originator
might be dead.


-
Jack Campin: 11 Third Street, Newtongrange, Midlothian EH22 4PU; 0131 6604760
http://www.purr.demon.co.uk/jack * food intolerance data  recipes,
Mac logic fonts, Scots traditional music files, and my CD-ROM Embro, Embro.
-- off-list mail to j-c rather than abc at this site, please --


To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] File names

2004-04-29 Thread Steven Bennett
Phil Taylor wrote:
 OK, I understand that.  What was bothering me though, is how Steven B's
 parser is going to deal with regular ascii strings which include a
 space followed by a bracket.  It's no problem when everything is
 unicode, or everything is ascii, but if we are to have ascii abc which
 may include unicode strings, we will need a way of indicating this to
 the parser, will we not?

Actually, you really can't mix encodings in the same file -- it will become
nearly impossible to parse, since you have no way of telling where the
switch occurs, and can't identify most encodings from their data anyway.

The closest thing to a mixed encoding would be a UTF-8 file, which uses
multi-byte sequences to reproduce Unicode characters which cannot be
expressed in a ASCII.  I'm not sure how the Unicode line endings I listed
get translated into UTF-8 offhand.

A Unicode text file (UTF-16 - each character is a 16 bit word, although
Unicode has multi-word sequences as well...) is usually identified by it
starting with the sequence 0xfeff or 0xfffe in the first two bytes of the
file.  If you're going to support Unicode, you should probably keep all your
strings in unicode format internally, and convert everything else you read
to that.

UTF-8 is tougher.  It has a similar identifying sequence (0xefbbff) but I've
found not all UTF-8 files include that sequence.  (And even some Mac tools
forget to put it there.  IMHO, they should fix that -- I may even report it
as a bug...)  If you're running on a system where UTF-8 is common, then you
might try it with all files, since ASCII is a subset of UTF-8.

Files in any other encoding (Old Mac OS multi-byte, Windows multi-byte,
Shift-JIS, or any of the various 1 byte encoding pages, like Windows Latin-1
or Mac OS Western) are pretty much impossible to determine just from the
file contents.  If you really want to support them properly, you may need to
have a means for the user to specify the encoding.  The Mac OS X text editor
lists some 84 different encodings that the user can tell it a file may be,
and only 3 of them can be determined programmatically.

I could add that kind of user setting to my parser, but I'm more or less
satisfied to support the ones which are detected automatically - UTF-16,
UTF-8 (if it has the proper identifier), and the user's local default
encoding.

--Steve Bennett

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html


Re: [abcusers] Unicode [was: file names]

2004-04-29 Thread Steven Bennett
Christian M. Cepel wrote:

 It was my understanding that all unicode character sets contain English
 characters mapped to the same values they're mapped to in other sets.

Close -- Unicode is a *single* character set.  For convenience, you'll
frequently run into references to Unicode code pages, but all they are is a
range within the overall character set.  All characters from every encoding
that Unicode supports exist somewhere in that character set.

So with a Unicode (UTF-8 or UTF-16) encoded text file you could easily have
English, Chinese, Korean, Russian, and Symbol characters all in the same
sentence.

Another convenient item is that the first Unicode code page 0x0001 - 0x007f
is the ASCII code.  So if you're using wchar instead of char as your string
pointer type, then comparisons like:

if (str[0] == 'K')

...will work the same when using Unicode or ASCII.  The only difference is
now str points to an array of 16 bit values instead of 8 bit ones.

--Steve Bennett

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html