On 1/28/26 23:56, Peter P. wrote:
Thanks Ben, your hint made the whole patch work with umlauts and
accents.

Thanks for the explanation IOhannes. I think I am having a hard time
understanding the following:

- Bytes, which have a range between 0 and 255.
- "Unicode points" which are numbers representing characters including
   umlauts?
- "ASCII characters", which are somehow bytes but also use unicode points 0-127.

yes, this is a good recap. not sure what you missed :-)

"ASCII characters" is a list of characters, numbered from 0 to 127.
these characters include the latin alphabet, numbers, a few special characters (like "~" or "#") and a handful of control characters.
notably, it lacks umlauts and the like.
<https://en.wikipedia.org/wiki/ASCII>

"Unicode" is another list of characters, numbered from 0 to a bit above 1000000 (these numbers are the so called "Unicode points"). Apart from the characters already in ASCII, it also list characters from different writing systems, including German (with umlauts!), Tamil and Chinese. It also lists a lot of emojis, but weirdly enough lacks characters for Tengwar.
<https://en.wikipedia.org/wiki/Unicode>

"bytes" are data chunks of 8bit (and are the fundamental unit of data when computers are involved, e.g. when reading files, transmitting things over the network,...)
since bytes are 8bit, they can represent numbers from 0 to 255.

if you have a text-file with plain ASCII text (only) in it, the bytes in the file correspond to the ASCII values. e.g. if the first byte in the text file is 0x48, this corresponds to the letter "H" (ASCII code 72). this is cool, but it really only works a byte can hold all the possible values of ASCII characters.

it doesn't work with Unicode. e.g. the letter "π" (Greek small letter pi), is assigned the code point 0x03C0 (number 960) - and there's no way to stuff this number into a single byte (8 bits).

a couple of schemes have been invented how these Unicode numbers are to be represented in bytes - but all of them need more than a single byte to represent 0x03C0.

the most common scheme in use today is UTF-8, which has a "variable length encoding" - meaning that a single unicode point can be represented as 1, 2, 3 or 4 bytes: basically, the smaller the value, the less bytes you need. a nice property of UTF-8 is, that it is a strict superset of ASCII, meaning that it is fully downward compatible with ASCII. that is: all characters from the ASCII table are assigned the same Unicode point *and* these characters are all represented as single bytes.
so your ASCII file (or any other ASCII file) is a fully valid UTF-8 file.
all characters beyond the 127 ASCII chars (upwards from unicode point 128), need more than 1 byte!
<https://en.wikipedia.org/wiki/UTF-8>



now, Pd's internal representation of characters uses UTF-8.
[fudiparse] takes a message, converts it into a UTF-8 string and then outputs the bytes of this string. e.g. a message [π( is converted into a string "π;\n" (the trailing semicolon and linefeed are FUDI specifics). these three characters have the Unicode points '960 59 10', which in UTF-8 is encoded as the bytes '207 128 59 10'. (in theory; it seems there's a minor bug so that the values are output a *signed* 8bit numbers, so the values '207 128' (0xCF and 0x80) are output as "-49 -128"'; luckily this bug usually doesn't matter, as the usual objects you use it's output with, e.g. [fudiparse], or [netsend], will happily handle this for you).

- The "string" message and its difference to the "text" message in
   [text2d].

the "text" message just takes Pd atoms and converts them into a string (very much like e.g. [print]) and renders it.

the "string" message takes a list of numbers (e.g. ASCII values, but really Unicode points!) converts them into a string and renders it.


if you have any more questions, do not hesitate to ask.

gasdm
IOhannes



--
please do not CC me for list-emails

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

---
[email protected] - the Pure Data mailinglist
https://lists.iem.at/hyperkitty/list/[email protected]/message/QR7UWHMIVDAW4I7BEW5RF2TIYA2JA6PU/

To unsubscribe send an email to [email protected] mailing list
UNSUBSCRIBE and account-management -> https://lists.iem.at/

Reply via email to