Re: Text encoding Babel. Was Re: George Keremedjiev
On Tue, 4 Dec 2018, Liam Proven wrote: > > I don't know if the unreal mode has been retained in the x86 architecture > > to this day; as I noted above it was not officially supported. But then > > some originally undocumented x86 features, such as the second byte of AAD > > and AAM instructions actually being an immediate argument that could have > > a value different from 10, have become standardised at one point. > > I know, and was surprised that, v86 mode isn't supported in x86-64. In the native long mode, that is. If you run the CPU 32-bit, then VM86 works. I guess AMD didn't want to burden the architecture in case pure 64-bit parts were made in the future. > This caused major problems for the developers of DOSEMU. And also for expansion-BIOS emulation, especially with graphics adapters (which, accompanied by scarce to inexistent hardware documentation, made mode switching even trickier in Linux than it already was). It looks like fully-software machine code interpretation like with QEMU is the only way remaining for x86-64. Maciej
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sat, 1 Dec 2018 at 02:00, Maciej W. Rozycki wrote: > Be assured there were enough IBM PC clones running DOS around from 1989 > onwards for this stuff to matter, OK, fair enough. Thanks for the info! > and hardly anyone switched to MS Windows > before version 95 (running Windows 3.0 with the ubiquitous HGC-compatible > graphics adapters was sort of fun anyway, and I am not sure if Windows 3.1 > even supported it; maybe with extra drivers). It did. Demo: https://www.youtube.com/watch?v=0lOGPQQlxT8 Screenshot: http://nerdlypleasures.blogspot.com/2016/12/windows-30-multimedia-edition-early.html The difficult bit was Windows 3.0 on an 8088/8086 with VGA, I believe. The VGA driver contained 80286 instructions because MS didn't imagine anyone would want Win3 on such old PCs. (This again shows that MS didn't believe Win3 would be such a big hit, giving the lie to all the pro-OS/2 anti-MS conspiracy theories... https://virtuallyfun.com/wordpress/2011/06/01/windows-3-0/ ) To run Win3 on an 8086 in VGA mode, you had to replace the CPU with an NEC V20 or V30, as I heard it and faintly recall... The driver did later get patched to work: http://www.vcfed.org/forum/showthread.php?35593-Windows-3-0-VGA-color-driver-for-8088-XT -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Tue, 4 Dec 2018 at 15:02, Maciej W. Rozycki via cctalk wrote: > I don't know if the unreal mode has been retained in the x86 architecture > to this day; as I noted above it was not officially supported. But then > some originally undocumented x86 features, such as the second byte of AAD > and AAM instructions actually being an immediate argument that could have > a value different from 10, have become standardised at one point. I know, and was surprised that, v86 mode isn't supported in x86-64. This caused major problems for the developers of DOSEMU. -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, 30 Nov 2018, Fred Cisin via cctalk wrote: > > Well, ATA drives at that time should have already had the capability to > > remap bad blocks or whole tracks transparently in the firmware, although > > Not even IDE. > Seagate ST4096 (ST506/412 MFM) 80MB formatted, which was still considered > good size by those of us who weren't wealthy. Sure! You did need a bad block list for such a drive though. > > Of course the ability to remap bad storage areas transparently is not an > > excuse for the OS not to handle them gracefully, it was not that time yet > > back then when a hard drive with a bad block or a dozen was considered > > broken like it usually is nowadays. > > Yes, they still came with list of known bad blocks. Usually taped to the > drive. THIS one wasn't on the manufacturer's list, and neither SpeedStor nor > SpinRite could find it! > There were other ways to lock out a block besides filling it with a garbage > file, but that was easiest. IIRC for MS-DOS the canonical way was to mark the containing cluster as bad using a special code in the FAT. Both `format' and `chkdsk' were able to do that, as were some third-party tools. That ensured that disk maintenance tools, such as `defrag', didn't reuse the cluster for something else as it could happen with a real file assignment of such a cluster. > And, I did try to tell the Microsoft people that the OS "should recover > gracefully from hardware errors". In those words. I found switching to Linux a reasonable solution to this kind of customer service attitude. There you can fix an issue yourself or if you don't feel like, then you can hire someone to do it for you (or often just ask kindly, as engineers usually feel responsible for code they have committed, including any bugs). :) > > Did 3.1 support running in the real mode though (as opposed to switching > > to the real mode for DOS tasks only)? I honestly do not remember anymore, > > and ISTR it was removed at one point. I am sure 3.0 did. > > I believe that it did. I don't remember WHAT the program didn't like about > 3.1, or if there were a real reason, not just an arbitrary limit. > I don't think that the Cordata's refusal to run on 286 was based on a real > reason. > > But, the Win 3.1 installation program(s) balked at anything without A20 and a > tiny bit of RAM above 10h I didn't have a problem with having a few > dedicated machines (an XT with Cordata interface, an AT with Eiconscript card > for postscript and HP PCL, an AT Win 3.0 for the font editor, a machine for > disk duplication (no-notch disks), order entry, accounting, and lots of > machines with lots of different floppy drive types.) I also tested every > release of my programs on many variants of the platform (after I discovered > the hard way that 286 had a longer pre-fetch buffer than 8088!) Hmm, interesting. I never tried any version of MS Windows on a PC/XT class machine and the least equipped 80286-based system I've used had at least 1MiB of RAM and a chipset clever enough to remap a part of it above 1MiB. And then that was made available via HIMEM.SYS. What might be unknown to some is that apart from toggling the A20 mask gate HIMEM.SYS also switched on the so-called "unreal mode" on processors that supported it. These were at least the 80486 and possibly the 80386 as well (but my memory has faded about it at this point), and certainly not the 80286 as it didn't support segment sizes beyond 64kiB. This mode gave access to the whole 4GiB 32-bit address space to real mode programs, by setting data segment limits (sizes) to 4GiB. This was possible by programming segment descriptors in the protected mode and then switching back to the real mode without resetting the limits to the usual 64kiB value beforehand. This worked because unlike in the protected mode segment register writes made in the real mode only updated the segment base and not the limit stored in the corresponding descriptor. IIRC it was not possible for the code segment to use a 4GiB limit in the real mode as it would malfunction (i.e. it would not work as per real mode expectations), so it was left at 64kiB. According to Intel documentation software was required to reset segment sizes to 64kiB before switching back to the real mode, so this was not an officially supported mode of operation. MS Windows may or may not have made use of this feature in its real mode of operation; I am not sure, although I do believe HIMEM.SYS itself did use it (or otherwise why would it set it in the first place?). I discovered it by accident in early 1990s while experimenting with some assembly programming (possibly by trying to read from beyond the end of a segment by using an address size override prefix, a word or a doubleword data quantity and an offset of 0x and not seeing a trap or suchlike) and could not explain where this phenomenon came from as it contradicted the x86 processor manual I
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, Nov 30, 2018 at 3:28 PM Grant Taylor via cctalk < cctalk@classiccmp.org> wrote: > On 11/30/2018 02:33 PM, Jim Manley via cctalk wrote: > > There's enough slack in the approved offerings that electives can be > > weighted more toward the technical direction (e.g., user interface and > > experience) or the arts direction (e.g., psychology and history). The > idea > > was to close the severely-growing gap between those who know everything > > about computing and those who need to know enough, but not everything, to > > be truly effective in the information-dominant world we've been careening > > toward without nearly enough preparation of future generations. > > I kept thinking to myself that many of the people that are considered > pioneers in computers were actually something else by trade and learned > how to use computers and / or created what they needed for the computer > to be able to do their primary job. > -- > Grant. . . . > unix || die > Most people know that Newton's motivation for developing calculus was explaining the motions of the planets, but not many know that he served as the Warden, and then Master, of the Royal Mint, as well as being fascinated with optics and vision (to the point where he inserted a needle into one of his eyes!) and a closet alchemist. His competitor, Leibniz, was motivated to develop calculus by a strong desire to win more billiards bets from his fellow wealthy buddies in Hanover, the financial capital of Germany at the time, while developing the mathematics of the physics governing the collisions of billiard balls. Babbage was motivated to develop calculating and computing machines to eliminate the worldwide average of seven errors per page in astronomical, navigational, and mathematical tables of the 1820s. Shannon and Hamming (with whom I worked - the latter, not the former!) were motivated to represent Boolean logic in digital circuits and improve long-distance communications by formalizing how to predictably ferret more signal out of noise. Turing was motivated to test his computing theories to break the Nazi Enigma ciphers (character-oriented, vs. word-oriented codes) and moved far beyond the mathematical underpinnings of his theories into the engineering of Colossus and the bombes. Hollerith was motivated by the requirement to complete the decennial census tabulations within 10 years (the 1890 census was going to take 13 years to tabulate using traditional manual methods within the available budget). Mauchly and Eckert were motivated to automate calculations for ballistics tables for WW-II weapons systems that were being fielded faster than tables could be produced manually. Hopper developed the first compiler and the first programming language to use English words, Flow-Matic, that led, in turn, to COBOL being created to meet financial software needs. John Backus and the other developers of FORTRAN were likewise motivated by scientific and engineering calculation requirements. Kernigan, Ritchie, and Thompson were motivated by a desire to perform an immense prank, in the form of Unix and A/B/BCPL/C, on an unsuspecting and all-too-serious professional computing world ( http://www.stokely.com/lighter.side/unix.prank.html). Gates and Allen were motivated by all of the money lying around on desks, in their drawers, and in the drawers worn by the people sitting at said desks, to foist PC/MS-DOS and Windows on the less serious computing public. Kildall was motivated by the challenges of developing multi-pass compilation on systems with minimal microcomputer hardware resources. Meanwhile, the rest of the computing field was motivated to pursue the next shinier pieces of higher-performance hardware, developing ever-more-bloated programming languages, OSes, services, and applications that continue to slow down even the latest-and-greatest systems. Berners-Lee was motivated to help scientists and engineers at the European Organization for Nuclear Research (CERN - the Conseil Européen pour la Recherche Nucléaire) organize and share their work without having to become expert software developers in their own right. Yang, Filo, Brin, Page, Zuckerberg, et al, were motivated by whatever money could be scrounged from sofas used by couch-surfing, homeless Millenials (redundant syntax fully intended), and from local news outlets' advertising accounts. Selling everyone's, but their own, personally-identifiable information, probably including that of their own mothers, has been a welcome additional cornucopia of revenue to them. Computer science and engineering degrees weren't even offered yet when I attended the heavily science and engineering oriented naval institution where I earned my BS in engineering (70% of degrees awarded were in STEM fields). The closest you could get were math and electrical engineering degrees, taking the very few electives offered in CS and CE disciplines. Granted, the computer I primarily had access to was a secondhand GE-265
Re: Text encoding Babel. Was Re: George Keremedjiev
I found the bad spot and put a SECTORS.BAD file there, and then was OK. On Sat, 1 Dec 2018, Maciej W. Rozycki wrote: Well, ATA drives at that time should have already had the capability to remap bad blocks or whole tracks transparently in the firmware, although Not even IDE. Seagate ST4096 (ST506/412 MFM) 80MB formatted, which was still considered good size by those of us who weren't wealthy. Of course the ability to remap bad storage areas transparently is not an excuse for the OS not to handle them gracefully, it was not that time yet back then when a hard drive with a bad block or a dozen was considered broken like it usually is nowadays. Yes, they still came with list of known bad blocks. Usually taped to the drive. THIS one wasn't on the manufacturer's list, and neither SpeedStor nor SpinRite could find it! There were other ways to lock out a block besides filling it with a garbage file, but that was easiest. And, I did try to tell the Microsoft people that the OS "should recover gracefully from hardware errors". In those words. I had a font editor that wouldn't tolerate 3.1, and quite a few XTs (no A20), so I continued to keep Win 3.0 on a bunch of machines. Did 3.1 support running in the real mode though (as opposed to switching to the real mode for DOS tasks only)? I honestly do not remember anymore, and ISTR it was removed at one point. I am sure 3.0 did. I believe that it did. I don't remember WHAT the program didn't like about 3.1, or if there were a real reason, not just an arbitrary limit. I don't think that the Cordata's refusal to run on 286 was based on a real reason. But, the Win 3.1 installation program(s) balked at anything without A20 and a tiny bit of RAM above 10h I didn't have a problem with having a few dedicated machines (an XT with Cordata interface, an AT with Eiconscript card for postscript and HP PCL, an AT Win 3.0 for the font editor, a machine for disk duplication (no-notch disks), order entry, accounting, and lots of machines with lots of different floppy drive types.) I also tested every release of my programs on many variants of the platform (after I discovered the hard way that 286 had a longer pre-fetch buffer than 8088!)
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, 30 Nov 2018, Fred Cisin via cctalk wrote: > I found the bad spot and put a SECTORS.BAD file there, and then was OK. > The Microsoft Beta program wanted cheerleaders, and ABSOLUTELY didn't want any > negative feedback nor bug reports, and insisted that the OS had no > responsibility to recover from nor survive hardware problems, and that > therefore it was not their problem. I told them that they would soon have to > do a recall (THAT was EXACTLY what happened with DOS 6.2x). They did not > invite me to participate in any more Betas. Well, ATA drives at that time should have already had the capability to remap bad blocks or whole tracks transparently in the firmware, although obviously it took some time for the industry to notice that and catch up with support for the relevant protocol requests in the software tools. It took many years after all for PC BIOS vendors to notice that ATA drives generally do report their C/H/S geometry supported (be it real or simulated; I only ever came across one early ATA HDD whose C/H/S geometry was real, all the rest were ZBR), so there is no need for the user to enter it manually for a hard drive to work. Of course the ability to remap bad storage areas transparently is not an excuse for the OS not to handle them gracefully, it was not that time yet back then when a hard drive with a bad block or a dozen was considered broken like it usually is nowadays. > I had a font editor that wouldn't tolerate 3.1, and quite a few XTs (no A20), > so I continued to keep Win 3.0 on a bunch of machines. Did 3.1 support running in the real mode though (as opposed to switching to the real mode for DOS tasks only)? I honestly do not remember anymore, and ISTR it was removed at one point. I am sure 3.0 did. Maciej
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sat, 1 Dec 2018, Maciej W. Rozycki via cctalk wrote: Be assured there were enough IBM PC clones running DOS around from 1989 onwards for this stuff to matter, and hardly anyone switched to MS Windows before version 95 (running Windows 3.0 with the ubiquitous HGC-compatible graphics adapters was sort of fun anyway, and I am not sure if Windows 3.1 even supported it; maybe with extra drivers). Depending on which question you are asking, . . . Windows 3.1 definitely did support Hercules video. We had about 3 dozen such machines (386SX) in the school student homework lab. It also supported CGA, but initially didn't come with the driver, so it would work if you upgraded from 3.0 to 3.1, or otherwise used the 3.0 CGA driver. In August 1991, I went to a Microsoft conference in Seattle. Although it was the anniversary of the 5150, Bill Gates was making appearances on the east coast, instead of being there. They asked our opinion of the NEW flying ["dry rot" disintegrating] window logo, and couldn't believe it that we did NOT love it. I found out about, and got a copy of, a CD-ROM "International" Windows 3.0, with many languages, including Chinese! I loved being able to install from CD, instead of boxes of floppies, and was glad that they were at least trying to expand to the rest of the world. They introduced Windows 3.1. But the borrowed Toshiba laptop that I had with me had 1MB of contiguous RAM, but not A20 support, and 3.1 "NEEDED" 64K above 1MB for HIMEM.SYS, which "SOLVES the problem of not enough RAM". 3.1 also was the first product to force SMARTDRV.SYS. As soon as I got home, I contacted the Win3.1 Beta program to tell them that write-cacheing without a way to turn it off was a BIG problem. There was a bad spot on the hard drive that I was installing it to that neither Spinrite nor SpeedStor could find, but it consistently crashed the 3.1 installation. But, with the forced write cacheing, there was NO possible way to recover. (Without write-cacheing, you just rename the file that failed, and manually install another copy of that one file) I found the bad spot and put a SECTORS.BAD file there, and then was OK. The Microsoft Beta program wanted cheerleaders, and ABSOLUTELY didn't want any negative feedback nor bug reports, and insisted that the OS had no responsibility to recover from nor survive hardware problems, and that therefore it was not their problem. I told them that they would soon have to do a recall (THAT was EXACTLY what happened with DOS 6.2x). They did not invite me to participate in any more Betas. I had a font editor that wouldn't tolerate 3.1, and quite a few XTs (no A20), so I continued to keep Win 3.0 on a bunch of machines.
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, 25 Nov 2018, Liam Proven via cctalk wrote: > > > For example, right now, I am in my office in Křižíkova. I can't > > > type that name correctly without Unicode characters, because the ANSI > > > character set doesn't contain enough letters for Czech. > > > > Intriguing. Is there an old MS-DOS Code Page (or comparable technique) > > that does encompass the necessary characters? > > Don't know. But I suspect there weren't many PCs here before the > Velvet Revolution in 1989. Democracy came around the time of Windows > 3.0 so there may not have been much of a commerical drive. Be assured there were enough IBM PC clones running DOS around from 1989 onwards for this stuff to matter, and hardly anyone switched to MS Windows before version 95 (running Windows 3.0 with the ubiquitous HGC-compatible graphics adapters was sort of fun anyway, and I am not sure if Windows 3.1 even supported it; maybe with extra drivers). Anyway MS-DOS 5.0 onwards had a complete set of code pages for various regions of the world. For Czechia, Hungary, Lithuania, Poland, and other European countries located towards the east and using a language with a latin transcription code page 852 was provided. For France, Germany, Spain, Nordic countries, etc. page 850 was provided. There were other pages included as well, beyond the IBM's original page 437, including Greek and Cyrillic, but I don't know the details. It's quite likely Wikipedia has them. Of course the HGC didn't support text mode character switching, however ISA VGA clones started trickling in at one point too. I still have my ISA Trident TVGA 8900C adapter from 1993 working in one of my machines, though I have since switched to Linux. NB my last name is also correctly spelled Różycki rather than Rozycki, and the two letters with the diacritics are completely different from and have sounds associated that bear no resemblance to the corresponding ones without, i.e. these are not merely accents, which we don't have in Polish at all (Polish complicates this further in that the sound of `ó' is the same as the sound of `u' and the sound of `ż' is the same as the sound of `rz' (which is BTW different from where the two letters are written separately), however the alternatives are not interchangeable and are either invalid or change the meaning of a word, and many native Polish speakers get them wrong anyway). FWIW, Maciej
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/30/2018 03:57 PM, Sean Conner via cctalk wrote: There are several problems with this. One, how many bits do you set aside per character? 8? 16? There are potentially an open ended set of stylings that one might use. I acknowledge that the idea I shared was incomplete and likely has shortcomings. But I do think that it demonstrates a concept, which is what I was after. Second problem---where do you store such bits? Not to imply this is a bad idea, just that there are issues that need to be resolved with how things are done today (how does this interact with UTF-8 for instance? Or UCS-4?). Ideally, I'd like to see UTF-8 / UTF-16 code points (?) for the different styles of a letter. Not every letter (character ~> byte / double) needs the styling. So I suspect that it would be better to judiciously place code points in the UTF-8 / UTF-16 space. Sadly, when I try to search for "this", the letters aren't found in "푡ℎ푖푠 푖푠 푎 푠푡푟푖푛푔" or "혁헵헶혀 헶혀 헮 헰헼헺헺헲헻혁". Something that I think should work. Also, storage of these letters can work just like it is in this email. ;-) -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
It was thus said that the Great Keelan Lightfoot via cctalk once stated: > > I see no reason that we can't have new control codes to convey new > > concepts if they are needed. > > I disagree with this; from a usability standpoint, control codes are > problematic. Either the user needs to memorize them, or software needs > to inject them at the appropriate times. There's technical problems > too; when it comes to playing back a stream of characters, control > characters mean that it is impossible to just start listening. It is > difficult to fast forward and rewind in a file, because the only way > to determine the current state is to replay the file up to that point. [ and further down the message ... ] > I'm going to lavish on the unicode for this example, so those of you > properly unequipped may not see this example: > > foo := 푡ℎ푖푠 푖푠 푎 푠푡푟푖푛푔 혁헵헶혀 헶혀 헮 헰헼헺헺헲헻혁 > printf(푡ℎ푒 푠푡푟푖푛푔 푖푠 ① 푖푠푛푡 푡ℎ푎푡 푒푥푐푖푡푖푛푔, foo) > if 혁헵헶혀 헶혀 헮 헽헼헼헿헹혆 헽헹헮헰헲헱 헰헼헺헺헲헻혁 foo == > 푡ℎ푖푠 푖푠 푎푙푠표 푎 푠푡푟푖푛푔, 푏푢푡 푛표푡 푡ℎ푒 푠푎푚푒 > 표푛푒 { 혁헵헶혀 헶혀 헮헹혀헼 헮 헰헼헺헺헲헻혁 > ... > > An atrocious example, but a good demonstration of my point. If I had a > toggle switch on my keyboard to switch between code, comment and > string, it would have been much simpler to construct too! Somehow, the compiler will have to know that "푡ℎ푖푠 푖푠 푎 푠푡푟푖푛푔" is a string while "혁헵헶혀 헶혀 헮 헰헼헺헺헲헻혁" is a comment to be ignored. You lamented the lack of a toggle switch for the two, but existing langauges, like C, already have them, '"' is the "toggle" for strings, while '/*' and '*/' are the toggles for comment (and now '//' if you are using C99). It's still something you have to "type" (or "toggle" or "switch" or somehow indicate the mode). The other issue is now such inforamtion is stored, and there, I only see two solutions---in-band and out-of-band. In-band would be included with the text. Something along the lines of (where is the ASCII ESC character 27, and this is an example only): foo := _this is a string\ ^this is a comment\ printf(_the string is [1p isn't that exciting\,foo) But this has a problem you noted above---it's a lot harder to seek through the file to arbitrary positions. Grant Taylor stated another way of doing this: > What if there were (functionally) additional bits that indicated various > other (what I was calling) stylings? > > I think that something along those lines could help avoid a concern I > have. Namely how do search for an A, what ever ""style it's in. I > think I could hypothetically search for bytes ~> words (characters) > containing ( ) () 01x1 (assuming that the > proceeding don't cares are set appropriately) and find any format of A, > upper case, lower case, bold, italic, underline, strike through, etc. There are several problems with this. One, how many bits do you set aside per character? 8? 16? There are potentially an open ended set of stylings that one might use. Second problem---where do you store such bits? Not to imply this is a bad idea, just that there are issues that need to be resolved with how things are done today (how does this interact with UTF-8 for instance? Or UCS-4?). Then there's out-of-band storage, which stores such information outside the text (an example---I'm not saying this is the only way to store such information out-of-band): foo := this is a string this is a comment printf(the string is 1 isn't that exciting,foo) --- string 8-23 string 50-63 string 65-84 replacement 64 comment 25-41 This has its own problems---namely, how to you keep the two together. It will either be a separate file, which could get separated, or part of the text file but then you run into the problem of reading Microsoft Word files cira 1986 with today's tools. -spc (I like the ideas, but the implementations are harder than it first appears ... )
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/30/2018 02:33 PM, Jim Manley via cctalk wrote: There's enough slack in the approved offerings that electives can be weighted more toward the technical direction (e.g., user interface and experience) or the arts direction (e.g., psychology and history). The idea was to close the severely-growing gap between those who know everything about computing and those who need to know enough, but not everything, to be truly effective in the information-dominant world we've been careening toward without nearly enough preparation of future generations. I kept thinking to myself that many of the people that are considered pioneers in computers were actually something else by trade and learned how to use computers and / or created what they needed for the computer to be able to do their primary job. -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/30/2018 11:34 AM, Keelan Lightfoot via cctalk wrote: Thanks! :-) Both. In the beginning we were content, because the keyboard was well suited to the capabilities of the technology available at the time it was invented. We didn't see a better way, because when compared to using a pen and paper (for writing) or using toggle switches (to control a computer), a keyboard was a significant improvement. It's the the explosive growth and universal adoption of computers that has locked us in to the keyboard as the standard. *sigh* The Steve G. from Security Now's comments about passwords not going away comes to mind and seems apropos for keyboards. There are other, likely better, things out there. But keyboards themselves aren't going to go away. I disagree with this; from a usability standpoint, control codes are problematic. Either the user needs to memorize them, or software needs to inject them at the appropriate times. Okay. There's technical problems too; when it comes to playing back a stream of characters, control characters mean that it is impossible to just start listening. It is difficult to fast forward and rewind in a file, because the only way to determine the current state is to replay the file up to that point. Now I'm wondering about something akin to the differences in upper case and lower case. Functionally the same code, just a different value in the 6th bit. What if there were (functionally) additional bits that indicated various other (what I was calling) stylings? I think that something along those lines could help avoid a concern I have. Namely how do search for an A, what ever ""style it's in. I think I could hypothetically search for bytes ~> words (characters) containing ( ) () 01x1 (assuming that the proceeding don't cares are set appropriately) and find any format of A, upper case, lower case, bold, italic, underline, strike through, etc. The other think that the additional bit / flags could do is allow the bytes (words / characters) to be read mid-stream. Do you mean modal control codes? As in "everything after here is bold" and "the bold stops here"? Yes. That's what I was thinking when I wrote that. We've gone backwards sadly. For a brief while, this kind of rich user interface stuff was provided by the OS. A text box, regardless of the application, would use the OS's text box control, and would have a universal interface for rich text. Indeed. But the growth of the web has resulted in an atavism. We're back to plain text, and using markup to style our text. I mostly agree. But I do wonder how true that actually is, at least on a technical level. I think the text input box can be enhanced to allow more than just plain text. If I want bold text in Slack, I have to use markup. Facebook Messages and YouTube comments also support markup, but the syntax is slightly different between them. *sigh* Back in 1991, If I wanted bold text in any application that supported rich text on my SE/30, I hit command-B and I got bold text. Sure, there are Javascript rich text editors that can be bolted on, but they all have their own UI concepts, and they're all a trainwreck. I believe that we can do better. In addition to crusty old computers, I also enjoy the company of three crusty old Linotypes. In fact, that's what got me thinking about this stuff in the first place. The Linotype keyboard has 90 keys, which directly map to the 90 glyphs a Linotype can "render". The keyboard is laid out in three qual sized sections: lowercase letters on the left, uppercase on the right, with numbers and punctuation in the middle. Push the button, and what's marked on the button is what ultimately ends up on the page. Each Linotype mat (matrix; letter mold) has two positions, which can be selected by flipping a little lever when they're being assembled into a line. The two positions are almost always used to select between two versions of a font; roman/bold or roman/italic are the most common pairings. Intriguing. I have a vague mental image of what you're talking about after watching Linotype: The Film (http://www.linotypefilm.com). I found it quite entertaining and informative. But what it means is that you can walk up to a machine with a half-typed line in the assembler and immediately determine its state. Any mats set in the bold position are in a physically different position in the assembler. The position of the switch tells you if you're typing in bold or roman. When you push the 'A' key, you know an uppercase 'A' in bold will be added to the line. Additionally, the position of that switch can be verified without taking your eyes off of the copy. There is no black magic, no spooky action at a distance. The capabilities of the machine are immediately apparent. I was not aware of the physically different positions. But either I don't remember, pick up on, or they
Re: Text encoding Babel. Was Re: George Keremedjiev
> Back on topic, the tools exist, but they are often seen as toys and > not serious software > development tools. Are we at the point where the compiler for a visual > programming > language is written in the visual programming language? > > - Keelan > Hi Keelan, I was going to mention this further back in the thread when visual programming was first mentioned, but for those not aware, there has been a shift in emphasis in teaching computing principles to newbies who have no idea what a bit, byte, assembler, compiler, interpreter, etc., are. UC Berkeley's "The Beauty and Joy of Computing" (and a follow-on "The Beauty and Joy of Data", offered at some institutions) curricula are increasingly being taught (starting in high school advanced placement computer science, as well as in freshman coursework in universities) to convey fundamental computing concepts: https://bjc.berkeley.edu The associated courses are taught using a visual programming environment called Snap!, where the (now browser-based, thank goodness) ease-of-use of Scratch (drag-and-drop interface, visual metaphors for loops, conditionals, etc., as well as easy animation tools) is combined with the power of Scheme (first class procedures, first class lists, first class objects, and first class continuations). https://snap.berkeley.edu Some universities have begun offering Bachelor of Arts degrees in CS, in addition to BSCSs, where about half of the BACS coursework is technically-oriented, and the remainder is oriented to more traditional arts offerings. TB, TB, and Snap! form a bridge so that students who ordinarily would never even consider studying CS can become knowledgeable enough to truly comprehend and appreciate computing's possibilities and limitations in its role in civilization (or at least what's left of it). There's enough slack in the approved offerings that electives can be weighted more toward the technical direction (e.g., user interface and experience) or the arts direction (e.g., psychology and history). The idea was to close the severely-growing gap between those who know everything about computing and those who need to know enough, but not everything, to be truly effective in the information-dominant world we've been careening toward without nearly enough preparation of future generations. I haven't worked with Snap! enough yet to know for sure whether it can be used to develop itself, but I strongly suspect that is the case (it's actually implemented in Javascript using an HTML5 canvas due to its browser-based nature). It wouldn't be suitable for doing systems level development, unless optimized C code (or equivalent) could be emitted, but it could certainly be used to demonstrate the logic principles involved in any level of software development that most people are ever likely to need to understand. There's mention of Snap! programs being convertible to mainstream programming languages such as Python, JavaScript, C, etc., but I haven't traced to ground in documentation how that's supposed to happen, yet. We may be part-way there because Google's Blockly spin-off of Scratch can already emit five scripting languages (Javascript, Python, PHP, Lua, and Dart), and it uses a modular approach where emission of code in additional languages could reportedly be added. That magic word, "optimized", is the key to whether the code is fundamentally correct and would need oodles of hand-rewriting to improve efficiency, or there are ways to automate at least some of the optimization. Snap! can be run off-line in a browser, as well as on the on-line primary and mirror sites, and standalone applications can be generated. Scratch has been extended to provide an easy way to control and sense physical environments via typical robotics components, but I haven't looked to see if Snap! has inherited those extensions. For any doubters, note that Pacman was ported to Scratch years ago, complete with the authentic sounds (including the "shrivel and disappear-in-death" clip), so ... ;^) All the Best, Jim
Re: Text encoding Babel. Was Re: George Keremedjiev
> Welcome. :-) Thanks! > Do you think that we stopped enhancing the user input experience more > because we were content with what we had or because we didn't see a > better way to do what we wanted to do? Both. In the beginning we were content, because the keyboard was well suited to the capabilities of the technology available at the time it was invented. We didn't see a better way, because when compared to using a pen and paper (for writing) or using toggle switches (to control a computer), a keyboard was a significant improvement. It's the the explosive growth and universal adoption of computers that has locked us in to the keyboard as the standard. > I agree that markup languages are a kludge. But I don't know that they > require plain text to describe higher level concepts. > > I see no reason that we can't have new control codes to convey new > concepts if they are needed. I disagree with this; from a usability standpoint, control codes are problematic. Either the user needs to memorize them, or software needs to inject them at the appropriate times. There's technical problems too; when it comes to playing back a stream of characters, control characters mean that it is impossible to just start listening. It is difficult to fast forward and rewind in a file, because the only way to determine the current state is to replay the file up to that point. > Aside: ASCII did what it needed to do at the time. Times are different > now. We may need more / new / different control codes. > > By control codes, I'm meaning a specific binary sequence that means a > specific thing. I think it needs to be standardized to be compatible > with other things -or- it needs to be considered local and proprietary > to an application. Do you mean modal control codes? As in "everything after here is bold" and "the bold stops here"? > I actually wonder how much need there is for /all/ of those utilities. > I expect that things should have streamlined and simplified, at least > some, in the last 30 years. We've gone backwards sadly. For a brief while, this kind of rich user interface stuff was provided by the OS. A text box, regardless of the application, would use the OS's text box control, and would have a universal interface for rich text. But the growth of the web has resulted in an atavism. We're back to plain text, and using markup to style our text. If I want bold text in Slack, I have to use markup. Facebook Messages and YouTube comments also support markup, but the syntax is slightly different between them. Back in 1991, If I wanted bold text in any application that supported rich text on my SE/30, I hit command-B and I got bold text. Sure, there are Javascript rich text editors that can be bolted on, but they all have their own UI concepts, and they're all a trainwreck. > What would you like to do or see done differently? Even if it turns out > to be worse, it would still be something different and likely worth > trying at least once. In addition to crusty old computers, I also enjoy the company of three crusty old Linotypes. In fact, that's what got me thinking about this stuff in the first place. The Linotype keyboard has 90 keys, which directly map to the 90 glyphs a Linotype can "render". The keyboard is laid out in three qual sized sections: lowercase letters on the left, uppercase on the right, with numbers and punctuation in the middle. Push the button, and what's marked on the button is what ultimately ends up on the page. Each Linotype mat (matrix; letter mold) has two positions, which can be selected by flipping a little lever when they're being assembled into a line. The two positions are almost always used to select between two versions of a font; roman/bold or roman/italic are the most common pairings. But what it means is that you can walk up to a machine with a half-typed line in the assembler and immediately determine its state. Any mats set in the bold position are in a physically different position in the assembler. The position of the switch tells you if you're typing in bold or roman. When you push the 'A' key, you know an uppercase 'A' in bold will be added to the line. Additionally, the position of that switch can be verified without taking your eyes off of the copy. There is no black magic, no spooky action at a distance. The capabilities of the machine are immediately apparent. > I don't think of bold or italic or underline as second class concepts. > I tend to think of the following attributes that can be applied to text: > > · bold > [snip] > > I don't think that normal is superior to the other four (five) in any > way. I do think that normal does occur VASTLY more frequently than the > any combination of the others. As such normal is what things default to > as an optimization. IMHO that optimization does not relegate the other > styles to second class. I agree. I think that they're normal enough that they should exist as their own code points in unicode. Our
Re: Text encoding Babel. Was Re: George Keremedjiev
Some computing economics history: I'm an engineer and scientist by both education and experience, and one major difference between the disciplines is that engineers are required to pass coursework and demonstrate proficiency in economics. That's because we need to deliver things that actually do what customers think they paid for within strict budgets and schedules, or we go hungry. Scientists, on the other hand, if they can accurately predict what it will cost to prove a theory, aren't practicing science, because they have to already know the outcome and are taking no risk. A theoretically "superior" encoding may not see practical use by a significant number of people because of legacy inertia that often makes no sense, but is rooted in cultural, sociological, emotional, and other factors, including economics. Dvorak computer keyboards are allegedly far more efficient speed/accuracy-wise than QWERTY computer keyboards, so they should rule the computing world, but they don't. Keyboards that reduce the risk of repetitive stress injuries (e.g., carpal tunnel syndrome) should dominate the market for very sensible health reasons, but they don't, either. Legacy inertia is a beyotch to overcome, especially when international-level manufacturers and investors have a strong interest making lots of money from the status quo. Logic and reasoning are simply nowhere near enough to create the conditions necessary for widespread adoption - sometimes it's just good luck in timing (or, bad luck, as the case may be). ASCII was developed in an age when Teletypes and similar devices were the only textual I/O options, with fixed-width/size/style typefaces (font family is an attribute of a typeface - there's no such thing as a "font"). By the late 1950s, there were around 250 computer manufacturers, and none of their products were interoperable in any form. Until the IBM 360 was released in 1965, IBM had 14 product _lines_ that were incompatible with each other, despite having 20,000+ very capable scientists and engineers on their payroll. You can't blame the ASCII developers for lack of foresight when no one in their right mind back then would have ever predicted we could have upwards of a trillion bytes of memory in our pockets (e.g., the Samsung Note 9), much less multi-megapixel touch displays with millions of colors, with worldwide-reaching cellular/Internet access with milliseconds of round-trip response, etc. Someone thinking that they're going to make oodles of money from some supposedly new-and-improved proprietary encoding "standard" that discards five-plus decades of legacy intellectual and economic investment, is pursuing a fool's errand. Even companies with resources at the level of Apple, Google, Microsoft, etc., aren't that arrogant, and they've demonstrated some pretty heavy-duty chutzpah over time. BTW, you won't be able to patent what apparently amounts to a lookup table, and even if you copyright it, it will be a simple matter of developing functionally-equivalent code that performs a translation on-the-fly. See also the clever schemes where DVD encryption keys, that had been left on an unprotected server accessible via the Internet, were transformed into prime numbers that didn't infringe on the copyrights associated with the keys. True standards are open nowadays - the days of proprietary "standards" are a couple of decades behind us - even Microsoft has been publishing the binary structure of their Office document file formats. The specification for Word, that includes everything going back to v 1.0, is humongous, and even they were having fits trying to maintain the total spec, which is reportedly why they went with XML to create the .docx, .xlsx, .pptx, etc., formats. That also happened to make it possible to placate governments (not to mention customers) that are looking for any hint of anti-competitive behavior, and thus also made it easier for projects such as OpenOffice and LibreOffice to flourish. Typographical bigots, who are more interested in style than content, were safely fenced off in the back rooms of publishing houses and printing plants until Apple released the hounds on an unsuspecting public. I'm actually surprised that the style purists haven't forced Smell-o-Vision technology on The Rest of Us to ensure that the musty smell of old books is part of every reading "experience" (I can't stand the current common use of that word). At least I have the software chops to transform the visual trash that passes for "style" these days into something pleasing to _my_ eyes (see what I did there with "severely-flawed" ASCII? Here's how you can do /italics/ and !bold! BTW.). Nothing frosts me more than reading text that can't be resized and auto-reflowed, especially on mobile devices with extremely limited display real estate. I'm fully able-bodied and I'm perturbed by such bad design, so, I'm pretty sure that pages that prevent pinch-zooming, and that don't allow for direct
Re: Text encoding Babel. Was Re: George Keremedjiev
On Wed, 28 Nov 2018 at 09:27, Paul Koning via cctalk wrote: > I learned it about 15 years ago (OpenAPL, running on a Solaris workstation > with a modified Xterm that handled the APL characters). Nice. It made a > handy tool for some cryptanalysis programs I needed to write. > I am interested in this cryptanalysis program... > I wonder if current APL implementations use the Unicode characters for APL, > that would make things easy. > I can confirm that both NARS 2000 and Dyalog APL both use the Unicode APL characters. Regards, Christian -- Christian M. Gauger-Cosgrove STCKON08DS0 Contact information available upon request.
Re: Text encoding Babel. Was Re: George Keremedjiev
> On Nov 27, 2018, at 9:23 PM, Fred Cisin via cctalk > wrote: > >>> I have long wondered if there are computer languages that aren't rooted >>> in English / ASCII. I feel like it's rather pompous to assume that all >>> programming languages are rooted in English / ASCII. I would hope that >>> there are programming languages that are more specific to the region of >>> the world they were developed in. As such, I would expect that they >>> would be stored in something other than ASCII. > > On Tue, 27 Nov 2018, William Donzelli via cctalk wrote: >> APL. > > APL requires adding additional characters. That was a major obstacle to > acceptance, both in terms of keyboard and type ball (my use preceded CRT), > but also asking the user/programmer to learn new characters. I loved APL! I learned it about 15 years ago (OpenAPL, running on a Solaris workstation with a modified Xterm that handled the APL characters). Nice. It made a handy tool for some cryptanalysis programs I needed to write. I wonder if current APL implementations use the Unicode characters for APL, that would make things easy. > I love the use of an arrow for assignment. ... One of the strangest programming languages I've used is POP-2, which we used in an AI course (Expert Systems) at the University of Illinois, in 1976. Taught by a visiting prof from the University of Edinborough, I think Donald Mickie but I may have the name confused. Like APL, POP-2 had the same associativity for all operators. Unlike APL, the designers decided that the majority should win so assignment would be left-associative like everything else -- rather than APL's rule that all the other operators are right-associative like assignment. So you'd end up with statements like: n + 1 -> n More at https://en.wikipedia.org/wiki/POP-2 paul
Re: Text encoding Babel. Was Re: George Keremedjiev
On Tue, 27 Nov 2018 at 20:47, Grant Taylor via cctalk wrote: > > I don't think that HTML can reproduce fixed page layout like PostScript > and PDF can. It can make a close approximation. But I don't think HTML > can get there. Nor do I think it should. There are a wider panoply of options to consider. For instance, Display Postscript, and come to that, arguably, NeWS. Also, modern document-specific markups. I work in DocBook XML, which I dislike intensely. There's also, at another extreme, AsciiDoc (and Markdown (in various "flavours")), Restructured Text, and similar "lightweight" MLs: http://hyperpolyglot.org/lightweight-markup But there are, of course, rivals. DITA is also widely-used. And of course there are things like LyX/LaTeX/TeX, which some find readable. I am not one of them. But I get paid to do Docbook, I don't get paid to do TeX. Neal Stephenson's highly enjoyable novel /Seveneves/ contains some interesting speculations on the future of the Roman alphabet and what close contact with Cyrillic over a period will do to it. Aside: [[ > I'm not personally aware of any cases where ASCII limits programming > languages. But my ignorance does not preclude that situation from existing. APL and ColorForth, as others have pointed out. > I have long wondered if there are computer languages that aren't rooted > in English / ASCII. https://en.wikipedia.org/wiki/Qalb_(programming_language) More generally: https://en.wikipedia.org/wiki/Non-English-based_programming_languages Personally I am more interested in non-*textual* programming languages. A trivial candidate is Scratch: https://scratch.mit.edu/ But ones that entirely subvert the model of using linear files containing characters that are sequentially interpreted are more interesting to me. I blogged about one family I just discovered last week: https://liam-on-linux.livejournal.com/60054.html The videos are more or less _necessary_ here, because trying to describe this in text will fail _badly_. Well worth a couple of hours of anyone's time. ]] Anyway. To return to text encodings. Again I wish to refer to a novel; to Kim Stanley Robinson's "Mars trilogy", /Red Mars/, /Green Mars/ and /Blue Mars/. Or as a friend called them, "RGB Mars" or even "Technicolor Mars". A character presents an argument that if you try to summarise many things on a scale -- e.g. for text encodings, from simplicity and readability, to complexity and capability -- you can't encapsulate any sophisticated system. He urges a 4-cornered system, using the example of the "four humours": phlegm, bile, choler and sang. The opposed corners of the diagram are as important as the sides of the square; characteristics form the corners, but the intersections between them are what defines us. So. There is more than one scale here. At one extreme, we could have the simplest possible text encoding. Something like Morse code or Braille, which omits almost all "syntax" -- almost no punctuation, no carriage returns or anything like that, which are _metadata_, they are information about how to display the content, not content themselves. Not even case is encoded: no capitals, no minuscule letters. But of course a number of alphabets don't have that distinction, and it's not essential in the Roman alphabet. Slightly richer, but littered with historical baggage from its origins in teletypes: ASCII. Much richer, but still not rich enough for all the Roman-alphabet-using-languages: ANSI. Insanely rich, but still not rich enough for all the written languages: Unicode. (What plane? What encoding? What version, even?) At the other extreme, markup languages that either weren't really intended for humans but often are written by them -- e.g. the SGML/XML family -- or are only usable by relatively few humans -- e.g. the TeX family -- or that are almost never used by humans, e.g. PostScript, or HP PCL. And what I find a fairly happy medium -- AsciiDoc, say. Perfectly readable by untrained people as plain ASCII, can be written with mere hours of study, if that, but also can be processed and rendered into something much prettier. The richer the encoding, the harder it is for *humans* to read, and the more complex the software to handle it needs to be. So, yes, ASCII is perhaps too minimal. ANSI is just a superset. But I'd argue that there _should_ be a separation between at least 2, maybe 3 levels, and arguably more. #1 Plain text encoding. Ideally able to handle all the characters in all forms of the Latin alphabet, and single-byte based. Drop ASCII legacy baggage such as backspace, bell, etc. #2 Richer text, with simple markup, but human-readable and human-writable without needing much skill or knowledge. Along the lines of Markdown or *traditional* /email/ _formatting_ perhaps. #3 Formatted text, with embedded control codes. The Oberon OS does this. #4 Full 1980s word-processor-style document, with control codes, formatting, font and page layout features, etc. #5
Re: Text encoding Babel. Was Re: George Keremedjiev
On Wed, 28 Nov 2018 at 08:05, Fred Cisin via cctalk wrote: > > He also created the Canon Cat. > > His idea of a user interface included that the program should KNOW > (assume) what the user wanted to do. One of my heroes. I've never used a Cat or his other software UIs, but the demos I've seen are enough to make me wonder at how much we have lost already, and secondarily, if it would be possible to code up a Raskin-style editor in Emacs. It's about the only editor I know that's smart enough and programmable enough. Unfortunately, I also find it horrible to use and don't know how to do this. -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
Why not a language even more self-documenting than COBOL, wherein the main body is text, and special markers to identify the CODE that corresponds? On Wed, 28 Nov 2018, Sean Conner wrote: In the book _Programmers at Work_ there's a picture of a program Jef Raskin [1] wrote that basically embeds BASIC into a word processor document. [1] He started the Macintosh project at Apple. It was later taken over by Steve Jobs and taken in a different direction. He also created the Canon Cat. His idea of a user interface included that the program should KNOW (assume) what the user wanted to do. I showed him WHY the OS shouldn't go ahead (without asking for confirmation!) and format a disk that it couldn't read. I do not know whether that change got made before commercial release. (The Cat, incidentally, was SS 512 bytes per sector, with 10 sectors per track. Sometimes described as 256K (because use was primarily for imaging 256K of RAM), but sometimes [more accurately] as 384K) Before his death, I almost ended up with his electric van (Subaru 600 based). There was not enough computational capability in it for that to be on-topic here.
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/27/2018 9:11 PM, Sean Conner via cctalk wrote But I can still load and read circa-1968-plain-text files without issue, on a computer that didn't even exist at the time, using tools that didn't exist at the time. The same can't be said for a circa-1988-Microsoft-word file. It requires either the software of the time, or specialized software that understands the format. But where do find the 1968 plain text files? Right now I am looking for free online books on computers and computer science books in the 1971 to 1977 year range. a fictional example "HAL 9000 programing" AI BOOTSTRAPPING WITH A LISP 1st edition. Useful knowledge for back then. "HAL 9000 programing" HOW the AI BOOTSTRAPS windows 1000 in HOT JAVA 2001 edition. No so useful for historic knowledge. Looking to write a simple integer language as I have no floating point yet on 1973-1974 ish paper computer design. And yes it is 18 bits and TTL. Right now I am programing in C for what little quick and dirty software I have written and digging around for ideas. It would be nice if bitsavers could have the old 1st edition books. The latest may sell but old knowledge is being lost. Ben.
Re: Text encoding Babel. Was Re: George Keremedjiev
It was thus said that the Great Fred Cisin via cctalk once stated: > > >>I like the C comment example; Why do I need to call out a comment with > >>a special sequence of letters? Why can't a comment exist as a comment? > > Why not a language even more self-documenting than COBOL, wherein the main > body is text, and special markers to identify the CODE that corresponds? In the book _Programmers at Work_ there's a picture of a program Jef Raskin [1] wrote that basically embeds BASIC into a word processor document. -spc [1] He started the Macintosh project at Apple. It was later taken over by Steve Jobs and taken in a different direction.
Re: Text encoding Babel. Was Re: George Keremedjiev
It was thus said that the Great Keelan Lightfoot via cctalk once stated: > I'm a bit dense for weighing in on this as my first post, but what the heck. > > Our problem isn't ASCII or Unicode, our problem is how we use computers. > > Going back in time a bit, the first keyboards only recorded letters > and spaces, even line breaks required manual intervention. As things > developed, we upgraded our input capabilities a little bit (return > keys! delete keys! arrow keys!), but then, some time before graphical > displays came along, we stopped upgrading. We stopped increasing the > capabilities of our input, and instead focused on kludges to make them > do more. We created markup languages, modifier keys, and page > description languages, all because our input devices and display > devices lacked the ability to comprehend anything more than letters. > Now we're in a position where we have computers with rich displays > bolted to a keyboard that has remained unchanged for 150 years. Do you have anything in particular in mind? > Unpopular opinion time: Markup languages are a kludge, relying on > plain text to describe higher level concepts. TeX has held us back. > It's a crutch so religiously embraced by the people that make our > software that the concept of markup has come to be accepted "the way". > I worked with some university students recently, who wasted a > ridiculous amount of time learning to use LaTeX to document their > projects. Many of them didn't even know that page layout software > existed, they thought there was this broad valley in capabilities with > TeX on one side, and Microsoft Word on the other. They didn't realize > that there is a whole world of purpose built tools in between. Rather > than working on developing and furthering our input capabilities, > we've been focused on keeping them the same. Markup languages aren't > the solution. They are a clumsy bridge between 150 year old input > technology and modern display capabilities. > > Bold or italic or underlined text shouldn't be a second class concept, > they have meaning that can be lost when text is conveyed in > circa-1868-plain-text. But I can still load and read circa-1968-plain-text files without issue, on a computer that didn't even exist at the time, using tools that didn't exist at the time. The same can't be said for a circa-1988-Microsoft-word file. It requires either the software of the time, or specialized software that understands the format. > I've read many letters that predate the > invention of the typewriter, emphasis is often conveyed using > underlines or darkened letters. We've drawn this arbitrary line in the > sand, where only letters that can be typed on a typewriter are "text", > Everything else is fluff that has been arbitrarily decided to convey > no meaning. I think it's a safe argument to make that the primary > reason we've painted ourselves into this unexpressive corner is > because of a dogged insistence that we cling to the keyboard. There were conventions developed for typewriters to get around this. Underlining text indicated italicized text (if the typewriter didn't have the capability---some did). In fact, typewriters have more flexibility than computers do even today. Within the restriction of a typewriter (only characters and spaces) you could use the back-space key (which did not erase the previous character) and re-type the same character to get a bold effect. You could back-space and hit the underscore to get underlined text. You could back-space and hit the ` key to get a grave accent, and the ' to get an acute accent. With a bit more fiddling with the back-space and adjusting the paper via the platten, you could get umlauts (either via the . or ' keys). I think the original intent of the BS control character in ASCII was to facilitate this behavior, but alas, nothing ever did. Shame, it's a neat concept. > I like the C comment example; Why do I need to call out a comment with > a special sequence of letters? Why can't a comment exist as a comment? The smart-ass answer is "because the compiler only looks at a stream of text and needs a special marker" but I get the deeper question---is a plain text file the only way to program? No. There are other ways. There are many attempts at so-called "visual languages" but none of them have been used to any real extent. Yes, there are languages like Visual Basic or Smalltalk, but even with those, you still type text for the computer to run. The only really alternative programming language I know of is Excel. Seriously. That's about the closest thing you get to a comment existing as a comment without special markers, because you don't include those as part of the program (specifically, you will exclude those cells from the computation least you get an error). > Why is a comment a second class concept? When I take notes in the > margin, I don't explicitly need to call them out as notes. This > extends to
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/27/18 6:23 PM, Fred Cisin via cctalk wrote: > I love the use of an arrow for assignment. In teaching, a student's > FIRST encounter with programming can be daunting. Use of an equal sign > immediately runs up against the long in-grained concept of commutative > equality. You would be surprised how many first time students try to > say 3 = X . Then, of course, > N = 1 > N = N + 1 > is a mathematical "proof by induction" that all numbers are equal! > (Don't let a mathematician see that, or the universe will cease to > exist, and be replaced by something even more inexplicable!) It's worth noting that in 1963 ASCII, hex 5E was the up-arrow (now the circumflex) and hex 5F was the left-arrow (now underline). It's also worth nothing that in the original CDC 6-bit display code, there were symbols, not only for left-to-right arrow, but not equals, logical OR and AND, up- and down-arrow, equivalence, logical NOT, less-than-or-equal, and greater-than-or equal--pretty much the original Algool-60 special characters. --Chuck
Re: Text encoding Babel. Was Re: George Keremedjiev
It was thus said that the Great Grant Taylor via cctalk once stated: > On 11/27/2018 04:43 PM, Keelan Lightfoot via cctalk wrote: > > > >Unpopular opinion time: Markup languages are a kludge, relying on plain > >text to describe higher level concepts. > > I agree that markup languages are a kludge. But I don't know that they > require plain text to describe higher level concepts. > > I see no reason that we can't have new control codes to convey new > concepts if they are needed. > > Aside: ASCII did what it needed to do at the time. Times are different > now. We may need more / new / different control codes. > > By control codes, I'm meaning a specific binary sequence that means a > specific thing. I think it needs to be standardized to be compatible > with other things -or- it needs to be considered local and proprietary > to an application. [ snip ] > I don't think of bold or italic or underline as second class concepts. > I tend to think of the following attributes that can be applied to text: > > · bold > · italic > · overline > · strike through > · underline > · superscript exclusive or subscript > · uppercase exclusive or lowercase > · opposing case > · normal (none of the above) But there are defined control codes for that (or most of that list anyway). It's not ANSI, but an ISO standard. Let's see ... ^[[1m bold ^[[3m italic ^[[53m overline ^[[9m strike through ^[[4m underline ^[[0m normal The superscript/subscribe could be done via another font ^[[11m ... ^[[19m Maybe even the opposing case case ... um ... yeah. By the way, ^[ is a single character representing the ASCII ESC character (27). > I see no reason that the keyboard can't have keys / glyphs added to it. > > I'm personally contemplating adding additional keys (via an add on > keyboard) that are programmed to produce additional symbols. I > frequently use the following symbols and wish I had keys for easier > access to them: ≈, ·, ¢, ©, °, …, —, ≥, ∞, ‽, ≤, µ, > ≠, Ω, ½, ¼, ⅓, ¶, ±, ®, §, ¾, ™, ⅔, ¿, ⊕. Years ago I came across an IBM Model M keyboard that had the APL character set on the keyboard, along with the normal characters one finds. I would have bought it on the spot if it weren't for a friend of mine who saw it 10 seconds before I did. I did recently get another IBM Model M keyboard (an SSK model) that had additional labels on the keys: http://boston.conman.org/2018/10/31.2 The nice thing about the IBM Model M is the keycaps are easy to replace. > I will concede that many computers and / or programming languages do > behave based on text. But I am fairly confident that there are some > programming languages (I don't know about computers) that work > differently. Specifically, simple objects are included as part of the > language and then more complex objects are built using the simpler > objects. Dia and (what I understand of) Minecraft come to mind. You might be thinking of Smalltalk. -spc
Re: Text encoding Babel. Was Re: George Keremedjiev
I have long wondered if there are computer languages that aren't rooted in English / ASCII. I feel like it's rather pompous to assume that all programming languages are rooted in English / ASCII. I would hope that there are programming languages that are more specific to the region of the world they were developed in. As such, I would expect that they would be stored in something other than ASCII. On Tue, 27 Nov 2018, William Donzelli via cctalk wrote: APL. APL requires adding additional characters. That was a major obstacle to acceptance, both in terms of keyboard and type ball (my use preceded CRT), but also asking the user/programmer to learn new characters. I loved APL! I love the use of an arrow for assignment. In teaching, a student's FIRST encounter with programming can be daunting. Use of an equal sign immediately runs up against the long in-grained concept of commutative equality. You would be surprised how many first time students try to say 3 = X . Then, of course, N = 1 N = N + 1 is a mathematical "proof by induction" that all numbers are equal! (Don't let a mathematician see that, or the universe will cease to exist, and be replaced by something even more inexplicable!) Even the archaic keyword "LET" in BASIC helped clarify that. We tend to be dismissive of such problems, declaring that students "need to LEARN the right way". I remember a cartoon in a publication, that might have been Interface Age, where an archeologist looking at hieroglyphics says that it looks like a subset of APL. But, I think that the comment was more in regards to programming by non-English speaking programmers. While FORTRAN, COBOL, BASIC can be almost trivially adapted to Spanish, Italian, German, etc., What about Chinese? Japanese? Yes, there IS a Chinese COBOL! But, THOSE programmers essentially have to learn English before they can program! Surely a Chinese or Japanese based programming language could be developed. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: Text encoding Babel. Was Re: George Keremedjiev
On 2018-11-27 8:33 PM, Grant Taylor via cctalk wrote: > ... >> Bold or italic or underlined text shouldn't be a second class concept, >> they have meaning that can be lost when text is conveyed in >> circa-1868-plain-text. I've read many letters that predate the >> invention of the typewriter, emphasis is often conveyed using >> underlines or darkened letters. > > I don't think of bold or italic or underline as second class concepts. I > tend to think of the following attributes that can be applied to text: > > · bold > · italic > · overline > · strike through > · underline > · superscript exclusive or subscript > · uppercase exclusive or lowercase > · opposing case > · normal (none of the above) > This covers only a small fraction of the Latin-centric typographic palette - much of which has existed for 500 years in print (non-Latin much older). Computerisation has only impoverished that palette, and this is how it happens: Checklists instead of research. Work with typographers when trying to represent typography in a computer. The late Hermann Zapf was Knuth's close friend. That's the kind of expertise you need on your team. --Toby > I don't think that normal is superior to the other four (five) in any > way. I do think that normal does occur VASTLY more frequently than the > any combination of the others. As such normal is what things default to > as an optimization. IMHO that optimization does not relegate the other > styles to second class. > ...
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/27/2018 04:43 PM, Keelan Lightfoot via cctalk wrote: I'm a bit dense for weighing in on this as my first post, but what the heck. Welcome. :-) Our problem isn't ASCII or Unicode, our problem is how we use computers. Okay. Going back in time a bit, the first keyboards only recorded letters and spaces, even line breaks required manual intervention. As things developed, we upgraded our input capabilities a little bit (return keys! delete keys! arrow keys!), but then, some time before graphical displays came along, we stopped upgrading. We stopped increasing the capabilities of our input, and instead focused on kludges to make them do more. Do you think that we stopped enhancing the user input experience more because we were content with what we had or because we didn't see a better way to do what we wanted to do? We created markup languages, modifier keys, and page description languages, all because our input devices and display devices lacked the ability to comprehend anything more than letters. Now we're in a position where we have computers with rich displays bolted to a keyboard that has remained unchanged for 150 years. Hum Unpopular opinion time: Markup languages are a kludge, relying on plain text to describe higher level concepts. I agree that markup languages are a kludge. But I don't know that they require plain text to describe higher level concepts. I see no reason that we can't have new control codes to convey new concepts if they are needed. Aside: ASCII did what it needed to do at the time. Times are different now. We may need more / new / different control codes. By control codes, I'm meaning a specific binary sequence that means a specific thing. I think it needs to be standardized to be compatible with other things -or- it needs to be considered local and proprietary to an application. TeX has held us back. It's a crutch so religiously embraced by the people that make our software that the concept of markup has come to be accepted "the way". I worked with some university students recently, who wasted a ridiculous amount of time learning to use LaTeX to document their projects. Many of them didn't even know that page layout software existed, they thought there was this broad valley in capabilities with TeX on one side, and Microsoft Word on the other. They didn't realize that there is a whole world of purpose built tools in between. I actually wonder how much need there is for /all/ of those utilities. I expect that things should have streamlined and simplified, at least some, in the last 30 years. Rather than working on developing and furthering our input capabilities, we've been focused on keeping them the same. Markup languages aren't the solution. They are a clumsy bridge between 150 year old input technology and modern display capabilities. What would you like to do or see done differently? Even if it turns out to be worse, it would still be something different and likely worth trying at least once. Bold or italic or underlined text shouldn't be a second class concept, they have meaning that can be lost when text is conveyed in circa-1868-plain-text. I've read many letters that predate the invention of the typewriter, emphasis is often conveyed using underlines or darkened letters. I don't think of bold or italic or underline as second class concepts. I tend to think of the following attributes that can be applied to text: · bold · italic · overline · strike through · underline · superscript exclusive or subscript · uppercase exclusive or lowercase · opposing case · normal (none of the above) I don't think that normal is superior to the other four (five) in any way. I do think that normal does occur VASTLY more frequently than the any combination of the others. As such normal is what things default to as an optimization. IMHO that optimization does not relegate the other styles to second class. We've drawn this arbitrary line in the sand, where only letters that can be typed on a typewriter are "text", Everything else is fluff that has been arbitrarily decided to convey no meaning. I don't agree that the decision was made (by most people). At least not consciously. I will say that some people probably decided what a minimum viable product is when selling typewriters, and consciously chose to omit the other options. I think it's a safe argument to make that the primary reason we've painted ourselves into this unexpressive corner is because of a dogged insistence that we cling to the keyboard. I see no reason that the keyboard can't have keys / glyphs added to it. I'm personally contemplating adding additional keys (via an add on keyboard) that are programmed to produce additional symbols. I frequently use the following symbols and wish I had keys for easier access to them: ≈, ·, ¢, ©, °, …, —, ≥, ∞, ‽, ≤, µ, ≠, Ω, ½, ¼, ⅓, ¶, ±, ®, §, ¾, ™, ⅔, ¿, ⊕.
Re: Text encoding Babel. Was Re: George Keremedjiev
I'm a bit dense for weighing in on this as my first post, but what the heck. Our problem isn't ASCII or Unicode, our problem is how we use computers. Going back in time a bit, the first keyboards only recorded letters and spaces, even line breaks required manual intervention. As things developed, we upgraded our input capabilities a little bit (return keys! delete keys! arrow keys!), but then, some time before graphical displays came along, we stopped upgrading. We stopped increasing the capabilities of our input, and instead focused on kludges to make them do more. We created markup languages, modifier keys, and page description languages, all because our input devices and display devices lacked the ability to comprehend anything more than letters. Now we're in a position where we have computers with rich displays bolted to a keyboard that has remained unchanged for 150 years. Unpopular opinion time: Markup languages are a kludge, relying on plain text to describe higher level concepts. TeX has held us back. It's a crutch so religiously embraced by the people that make our software that the concept of markup has come to be accepted "the way". I worked with some university students recently, who wasted a ridiculous amount of time learning to use LaTeX to document their projects. Many of them didn't even know that page layout software existed, they thought there was this broad valley in capabilities with TeX on one side, and Microsoft Word on the other. They didn't realize that there is a whole world of purpose built tools in between. Rather than working on developing and furthering our input capabilities, we've been focused on keeping them the same. Markup languages aren't the solution. They are a clumsy bridge between 150 year old input technology and modern display capabilities. Bold or italic or underlined text shouldn't be a second class concept, they have meaning that can be lost when text is conveyed in circa-1868-plain-text. I've read many letters that predate the invention of the typewriter, emphasis is often conveyed using underlines or darkened letters. We've drawn this arbitrary line in the sand, where only letters that can be typed on a typewriter are "text", Everything else is fluff that has been arbitrarily decided to convey no meaning. I think it's a safe argument to make that the primary reason we've painted ourselves into this unexpressive corner is because of a dogged insistence that we cling to the keyboard. I like the C comment example; Why do I need to call out a comment with a special sequence of letters? Why can't a comment exist as a comment? Why is a comment a second class concept? When I take notes in the margin, I don't explicitly need to call them out as notes. This extends to strings, why do I need to use quotes? I know it's a string why can't the computer remember that too? Why do I have to use the capabilities of a typewriter to describe that to the computer? There seems to be confusion that computers are inherently text based. They are only that way because we program them and use them that way, and because we've done it the same way since the day of the teletype, and it's _how it's done._ "Classic" Macs are a great example of breaking this pattern. There was no way to force the computer into a text mode of operating, it didn't exist. Right down to the core the operating system was graphical. When you click an icon, the computer doesn't issue a text command, it doesn't call a function by name, it merely alters the flow of some binary stuff flowing through the CPU in response to some other bits changing. Yes, the program describing that was written in text, but that text is not what the computer is interpreting. I'm getting a bit philosophical, so I'll shut up now, but it's an interesting discussion. - Keelan
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/27/2018 12:47 PM, Grant Taylor via cctalk wrote: ASCII is a common way of encoding characters and control codes in the same binary pattern. File formats are what collections of ASCII characters / control codes mean / do. It also was designed for hard copy. Over strikes don't work well on a CRT screen. Ben.
Re: Text encoding Babel. Was Re: George Keremedjiev
> I have long wondered if there are computer languages that aren't rooted > in English / ASCII. I feel like it's rather pompous to assume that all > programming languages are rooted in English / ASCII. I would hope that > there are programming languages that are more specific to the region of > the world they were developed in. As such, I would expect that they > would be stored in something other than ASCII. APL. -- Will
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/27/2018 03:05 AM, Guy Dunphy wrote: It was a core of the underlying philosophy, that html would NOT allow any kind of fixed formatting. The reasoning was that it could be displayed on any kind of system, so had to be free-format and quite abstract. That's one of the reasons that I like HTML as much as I do. Which is great, until you actually want to represent a real printed page, or book. Like Postscript can. Thus html was doomed to be inadequate for capture of printed works. I feel like trying to accurately represent fixed page layout in HTML is a questionable idea. I would think that it would be better to use a different type of file. That was a disaster. There wasn't any real reason it could not be both. Just an academic's insistense on enforcing his ideology. Then of course, over time html has morphed to include SOME forms of absolute layout, because there was a real demand for that. But the result is a hodge-podge. I don't think that HTML can reproduce fixed page layout like PostScript and PDF can. It can make a close approximation. But I don't think HTML can get there. Nor do I think it should. Yes, it should be capable of that. But not enforce 'only that way'. I question if people are choosing to use HTML to store documentation because it's so popular and then getting upset when they want to do things that HTML is not meant to do. Or in some cases is actually meant to /not/ to. Use the tool for the job. Don't alter the wrong tool for your particular job. IMHO true page layout doesn't belong in HTML. Loosely laying out the same content in approximately the same layout is okay. By 'html' I mean the kludge of html-css-js. The three-cat herd. (Ignoring all the _other_ web cats.) Now it's way too late to fix it properly with patches. I don't agree with that. HTML (and XML) has markup that can be used, and changed, to define how the HTML is meant to be interpreted. The fact that people don't do so correctly is mostly independent of the fact that it has the ability. I say mostly because there is some small amount of wiggle room for discussion of does the functionality actually work or not. I meant there's no point trying to determine why they were so deluded, and failed to recognise that maybe some users (Ed) would want to just type two spaces. I /do/ believe that there /is/ a point in trying to understand why someone did what they did. now 'we' (the world) are stuck with it for legacy compatibility reasons. Our need to be able to read it does not translate to our need to continue to use it. Any extensions have to be retro-compatible. I disagree. I see zero reason why we couldn't come up with something new and completely different. Granted, there should be ways to translate from one to the other. Much like how ASCII and EBCDIC are still in use today. What I'm talking about is not that. It's about how to create a coding scheme that serves ALL the needs we are now aware of. (Just one of which is for old ASCII files to still make sense.) This involves both re-definition of some of the ASCII control codes, AND defining sequential structure standards. For eg UTF-8 is a sequential structure. So are all the html and css codings, all programming languages, etc. There's a continuum of encoding...structure...syntax. The ASCII standard didn't really consider that continuum. I don't think that ASCII was even trying to answer / solve the problems that you're talking about. ASCII was a solution for a different problem for a different time. There is no reason we can't move on to something else. Which exceptions would those be? (That weren't built on top of ASCII!) It is subject to the meaning of "back tot he roots" and not worth taking more time. I assume you're thinking that ASCII serves just fine for program source code? I'm not personally aware of any cases where ASCII limits programming languages. But my ignorance does not preclude that situation from existing. I do believe that there are a number of niche programming languages (if you will) that store things as binary data (I'm thinking PLCs and the likes) but occasionally have said data represented (as a hexadecimal dump) in ASCII. But the fact that ASCII can or can't easily display the data is immaterial to the system being programmed. I have long wondered if there are computer languages that aren't rooted in English / ASCII. I feel like it's rather pompous to assume that all programming languages are rooted in English / ASCII. I would hope that there are programming languages that are more specific to the region of the world they were developed in. As such, I would expect that they would be stored in something other than ASCII. Could the sequence of bytes be displayed as ASCII? Sure. Would it make much sense? Not likely. This is a bandwagon/normalcy bias effect. "Everyone does it that way and always has, so it must be
Re: Text encoding Babel. Was Re: George Keremedjiev
On Tue, Nov 27, 2018 at 01:21:52AM +1100, Guy Dunphy via cctalk wrote: [...] > Oh yes, tell me about the html 'there is no such thing as hard formatting and > you can't have any even when you want it' concept. Thank you Tim Berners Lee. Sure you can! Pick one of: a) If you're not using HTML features, don't bother wrapping the text in a HTML document. Just serve up a bog standard text/plain document with all of your favourite ASCII art and hard formatting as you please. b) Go old-school and use HTML . c) Go lah-di-dah new-school and use the CSS white-space: property to fine-tune the exact formatting behaviour of you desire.
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, 26 Nov 2018 at 15:21, Guy Dunphy via cctalk wrote: > Defects in the ASCII code table. This was a great improvement at the time, > but fails to implement several utterly essential concepts. The lack of these > concepts in the character coding scheme underlying virtually all information > processing since the 1960s, was unfortunate. Just one (of many) bad > consequences has been the proliferation of 'patch-up' text coding schemes > such as proprietry document formats (MS Word for eg), postscript, pdf, html > (and its even more nutty academia-gone-mad variants like XML), UTF-8, unicode > and so on. This is fascinating stuff and I am very interested to see how it comes out, but I think there is a problem here which I wanted to highlight. The thing is this. You seem to be discussing what you perceive as _general_ defects in ASCII, but they are I think not _general_ defects. They are specific to your purpose, and I don't know what that is exactly, but I have a feeling it is not a general overall universal goal. Just consider what "A.S.C.I.I." stands for. [1] it's American. Yes it has lots of issues internationally, but it does the job well for American English. As a native English speaker I rue the absence of £ but the fact that Americans as so unfamiliar with the symbol that they even appropriate its name for the unrelated # which already had a perfectly good name of its own, but ASCII is American and Americans don't use £. Fine. [2] The "I.I." bit. Historical accidents aside, vestigial traces of specific obsolete hardware implementations, it's _not a markup language_. Its function is unrelated to those of HTML or XML or anything like that. It's for "information interchange". That means from computer or program to other computer or program. It's an encoding and that's all. We needed a standard one. We got it. It has flaws, many flaws, but it worked. No it doesn't contain æ and å and ä and ø and ö. That's a problem for Scandinavians. It doesn't contain š and č and ṡ and ý (among others) and that's a problem for Roman-alphabet-using Slavs. Even broadening the discussion to 8-bit ANSI... It does have a very poor way of encoding é and à and so on, which indicates the relative importance of Latin-language users in the Americas, compared to Slavs and so on. But markup languages, formatting, control signalling, all that sort of stuff is a separate discussion to encoding standards. Attempt to bring them into encoding systems and the problem explodes in complexity and becomes insoluble. Additionally, it also makes a bit of a mockery of OSes focussed on raw text streams, such as Unix, and whereas I am no great lover of Unix, it does provide me with a job, and less headaches than Windows. So, overall, all I wanted to say was: identify the problem domain specifically and how to separate that from over, *overlapping* domains before attacking ASCII for weaknesses that are not actually weaknesses at all but indeed strengths for a lot of its use-cases. Saying that, I'd really like to read more about this project. It looks like it peripherally intersects with one of my own big ones. -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, 26 Nov 2018 at 23:39, Christian Gauger-Cosgrove wrote: > > On Mon, 26 Nov 2018 at 03:44, Liam Proven via cctalk > wrote: > > If it's in Roman, Cyrillic, or Greek, they're alphabets, so it's a letter. > > > Correct, Latin, Greek, and Cyrillic are alphabets, so each > letter/character can be a consonant or vowel. > > > I can't read Arabic or Hebrew but I believe they're alphabets too. > > > Hebrew, Arabic, Syriac, Punic, Aramaic, Ugaritic, et cetera are > abjads, meaning that each character represents a consonant sound, > vowel sounds are either derived from context and knowledge of the > language, or can be added in via diacritics. > > Devanagari and Thai (and Tibetan, Khmer, Sudanese, Balinese...) are > abugidas, where each character is a consonant-vowel pair, with the > "base" character being one particular vowel sound, and alternates > being indicated by modifications (example in Devanagari: "क" is "ka", > while "कि" is "ki"; another example using Canadian Aboriginal > Syllabics "ᕓ" is "vai" whereas "ᕗ" is "vu"). > > > I don't know anything about any Asian scripts except a tiny bit of > > Japanese and Chinese, and they get called different things, but > > "character" is probably most common. > > > Japanese actually uses three different scripts. Chinese characters > (the kanji script of Japanese, and the hanja script of Korean) are > logograms. > > Japanese also has two syllabic scripts, katakana and hiragana where > each character represents a specific consonant vowel pair. > > Korean hangul (or if you happen to be from the DPRK, chosŏn'gŭl) is a > mix of alphabet and syllabary, where individual characters consist of > sub parts stacked in a specific pattern. Stealing Wikipedia's example, > "kkulbeol" is written as "꿀벌", not the individual parts "ㄲㅜㄹㅂㅓㄹ". > > > And now for even more fun, Egyptian hieroglyphics and cuneiform (which > started with Sumerian, and then used by the Assyrians/Babylonians and > others) are a delightful mix of logographic, syllabic and alphabetic > characters. Because while China loathes you, Babylon has a truly deep > hatred of you and wishes to revel in your suffering. Um. Yes. Thank you for that. Very informative, interesting, and I did actually know most of it already but maybe others didn't. The thing is that it's not actually very germane to the question I was addressing, which was "what do you call the individual units in different scripts?" I.e. "letter" vs "glyph" vs "character" vs "ideogram" vs "grapheme", etc... :-) -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
Oh yes, tell me about the html 'there is no such thing as hard formatting and you can't have any even when you want it' concept. Thank you Tim Berners Lee. I've not delved too deeply into the lack of hard formatting in HTML. The HTML . . . tag helps a bit. Before I found THAT, I was having serious difficulties with too much of what I tried to do with HTML. Obvious examples include ASCII art, but also program source code. I should NOT have to create a "table" for that, nor have difficulty having a string literal in code that contains varying numbers of space characters! For some reason, a few decades ago, I had substantial difficulty finding out about the existence of the tag, and at that time, did not find the tag, nor CSS. Now, it seems to be pretty easy to find. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/26/18 7:21 AM, Guy Dunphy wrote: I was speaking poetically. Perhaps "the mail software he uses was written by morons" is clearer. ;-) Oh yes, tell me about the html 'there is no such thing as hard formatting and you can't have any even when you want it' concept. Thank you Tim Berners Lee. I've not delved too deeply into the lack of hard formatting in HTML. I've also always considered HTML to be what you want displayed, with minimal information about how you want it displayed. IMHO CSS helps significantly with the latter part. http://everist.org/NobLog/20130904_Retarded_ideas_in_comp_sci.htm http://everist.org/NobLog/20140427_p-term_is_retarded.htm Intriguing. $readingList++. Except that 'non-breaking space' is mostly about inhibiting line wrap at that word gap. I wouldn't have thought "mostly" or "inhibiting line wrap". I view the non-breaking space as a way to glue two parts of text together and treat them as one unit, particularly for display and partially for selection. Granted, much of the breaking is done when the text can not continue (in it's natural direction), frequently needing to start anew on the next line. But anyway, there's little point trying to psychoanalyze the writers of that software. Probably involved pointy-headed bosses. I like to understand why things have been done the way they were. Hopefully I can learn from the reasons. Of course not. It was for American English only. This is one of the major points of failure in the history of information processing. Looking backwards, (I think) I can understand why you say that. But based on my (possibly limited) understanding of the time, I think that ASCII was one of the primordial building blocks that was necessary. It was a standard (one of many emerging standards of the time) that allowed computers from different manufacturers interoperate and represent characters with the same binary pattern. Something that we now (mostly) take for granted and something that could not be assured at the time or before. Containing extended Unicode character sets via UTF-8, doesn't make it a non-hard-formatted medium. In ASCII a space is a space, and multi-spaces DON'T collapse. White space collapse is a feature of html, and whether an email is html or not is determined by the sending utility. Having read the rest of your email and now replying, I feel that we may be talking about two different things. One being ASCII's standard definition of how to represent different letters / glyphs in a consistent binary pattern. The other being how information is stored in an (un)structured sequence of ASCII characters. As you see, this IS NOT HTML, since those extra spaces and your diagram below would have collapsed if it was html. Also saving it as text and opening in a plain text ed or hex editor absolutely reveals what it is. I feel it is important to acknowledge your point and to state that I'm moving on. Hmm... the problem is it's intended to be serious, but is still far from exposure-ready. So if I talk about it now, I risk having specific terms I've coined in the doco (including the project name) getting meme-jammed or trademarked by others. The plan is to release it all in one go, eventually. Definitely will be years before that happens, if ever. Fair enough. However, here's a cut-n-paste (in plain text) of a section of the Introduction (html with diags.) ACK -- Almost always, a first attempt at some unfamiliar, complex task produces a less than optimal result. Only with the knowledge gained from actually doing a new thing, can one look back and see the mistakes made. It usually takes at least one more cycle of doing it over from scratch to produce something that is optimal for the needs of the situation. Sometimes, especially where deep and subtle conceptual innovations are involved, it takes many iterations. Part way through the first large (for me at the time) project that I worked on, I decided that the project (and likely others) needed three versions before being production ready: 1) First whack at solving the problem. LOTS about the problem is learned, including the true requirements and the unknown dependencies along the say. This will not be the final shipping version. - Think of this as the Alpha release. 2) This is a complete re-write of the project based on what was learned in #1. - Think of this as the Beta release. 3) This is less of a re-write and more of a bug fix for version 2. - Think of this as the shipping release. Human development of computing science (including information coding schemes) has been effectively a 'first time effort', since we kept on developing new stuff built on top of earlier work. We almost never went back to the roots and rebuilt everything, applying insights gained from the many mistakes made. With few notable (partial) exceptions, I largely agree. In reviewing the evolution of
A modest side project : redefining text encoding (Was: Text encoding Babel. Was Re: George Keremedjiev
On Tue, 27 Nov 2018, Guy Dunphy via cctalk wrote: Hmm... the problem is it's intended to be serious, but is still far from exposure-ready. So if I talk about it now, I risk having specific terms I've coined in the doco (including the project name) getting meme-jammed or trademarked by others. The plan is to release it all in one go, eventually. Definitely will be years before that happens, if ever. However, here's a cut-n-paste (in plain text) of a section of the Introduction (html with diags.) Without pushing too hard to get you to reveal more than you are comfortable with, I really like what you wrote, and hope that someday we can participate in some aspects. I would like to see some acknowledgement that some things are truly flaws in the original design, whereas some others are ideas for further expansion and enhancement. It's probably not going to be possible to objectively differentiate which are which. And, as typified by Intel X86 V Motorola 68000, incremental kludges permit compatability and trivial ease of migration, whereas a design from scratch permits correcting aspects that would otherwise be stuck, at the expense of massive software re-creation.
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, 26 Nov 2018 at 03:44, Liam Proven via cctalk wrote: > If it's in Roman, Cyrillic, or Greek, they're alphabets, so it's a letter. > Correct, Latin, Greek, and Cyrillic are alphabets, so each letter/character can be a consonant or vowel. > I can't read Arabic or Hebrew but I believe they're alphabets too. > Hebrew, Arabic, Syriac, Punic, Aramaic, Ugaritic, et cetera are abjads, meaning that each character represents a consonant sound, vowel sounds are either derived from context and knowledge of the language, or can be added in via diacritics. Devanagari and Thai (and Tibetan, Khmer, Sudanese, Balinese...) are abugidas, where each character is a consonant-vowel pair, with the "base" character being one particular vowel sound, and alternates being indicated by modifications (example in Devanagari: "क" is "ka", while "कि" is "ki"; another example using Canadian Aboriginal Syllabics "ᕓ" is "vai" whereas "ᕗ" is "vu"). > I don't know anything about any Asian scripts except a tiny bit of > Japanese and Chinese, and they get called different things, but > "character" is probably most common. > Japanese actually uses three different scripts. Chinese characters (the kanji script of Japanese, and the hanja script of Korean) are logograms. Japanese also has two syllabic scripts, katakana and hiragana where each character represents a specific consonant vowel pair. Korean hangul (or if you happen to be from the DPRK, chosŏn'gŭl) is a mix of alphabet and syllabary, where individual characters consist of sub parts stacked in a specific pattern. Stealing Wikipedia's example, "kkulbeol" is written as "꿀벌", not the individual parts "ㄲㅜㄹㅂㅓㄹ". And now for even more fun, Egyptian hieroglyphics and cuneiform (which started with Sumerian, and then used by the Assyrians/Babylonians and others) are a delightful mix of logographic, syllabic and alphabetic characters. Because while China loathes you, Babylon has a truly deep hatred of you and wishes to revel in your suffering. Regards, Christian -- Christian M. Gauger-Cosgrove STCKON08DS0 Contact information available upon request.
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/26/2018 9:26 AM, Charles Anthony via cctalk wrote: On Mon, Nov 26, 2018 at 4:28 AM Peter Corlett via cctalk < cctalk@classiccmp.org> wrote: On Sun, Nov 25, 2018 at 07:59:13PM -0800, Fred Cisin via cctalk wrote: [...] Alas, "current" computers use 8, 16, 32. They totally fail to understand the intrinsic benefits of 9, 12, 18, 24, and 36 bits. Oh go on then, I'm curious. What are the benefits? Is it just that there are useful prime factors for bit-packing hacks? And if so, why not 30? As I understand it, 36 bits was used as it could represent a signed 10 digit decimal number in binary; the Frieden 10 digit calculator was the "gold standard" of banking and financial institutions, so to compete in that market, you computer had to be able to match the arithmetic standards. -- Charles I say 20 bits needs to be used more often. Did anything really use all the control codes in ASCII? Back then you got what the TTY printed. Did any one ever come up a a character set for ALGOL? Ben.
Re: Windows Accessibility Settings. RE: George Keremedjiev
On 11/26/2018 02:53 PM, Dave Wade via cctalk wrote: Just in case anyone isn't aware, and who gets duplicate characters input because they have some un-steadiness, and are using a Windows/10 PC (I think 7 as well) there are some options in in the "Ease of Access" settings "Filter Keys" settings => "bounce keys" that may help with your typing. These set a configurable delay that will ignore repeated keypresses for a very short period of time. The default is 0.5 of second but its configurable. You need to enable "Filter Keys" to see the "Bounce Keys" option. There is also a "slow keys" option. I've found that there are a number of features that land under accessibility / ease of access settings that can make the computer quite a bit nicer. So, if you've ever thought that "I don't need anything under 'Accessibility' or 'Ease of Access' settings." you may be missing out. Go check. I'm extensively using these assistants for a number of things, not the least of which is I'm lazy and I want my iPhone to auto-correct scsi to SCSI, or pppoe to PPPoE, or shruggie to ¯\_(ツ)_/¯, or ... to …, or … or … or I hope this helps, and I am sorry if you knew this already and it doesn't I think it's always good to share neat ~> helpful features with others. Especially if it's done in the positive sense of "this is really cool" and not negative "oh, you need some help, go look here." -- Grant. . . . unix || die
Windows Accessibility Settings. RE: George Keremedjiev
Just in case anyone isn't aware, and who gets duplicate characters input because they have some un-steadiness, and are using a Windows/10 PC (I think 7 as well) there are some options in in the "Ease of Access" settings "Filter Keys" settings => "bounce keys" that may help with your typing. These set a configurable delay that will ignore repeated keypresses for a very short period of time. The default is 0.5 of second but its configurable. You need to enable "Filter Keys" to see the "Bounce Keys" option. There is also a "slow keys" option. I hope this helps, and I am sorry if you knew this already and it doesn't Dave G4UGM > -Original Message- > From: cctalk On Behalf Of ED SHARPE via > cctalk > Sent: 26 November 2018 18:30 > To: lpro...@gmail.com; cctalk@classiccmp.org > Subject: Re: George Keremedjiev > > i use email i use and suggest you use a delete key. no loss no gain... > > > In a message dated 11/26/2018 11:16:07 AM US Mountain Standard Time, > lpro...@gmail.com writes: > > > On Mon, 26 Nov 2018 at 17:54, ED SHARPE < couryho...@aol.com> wrote: > > > > pay attention it us,probaby my hand which adds,Xtra spaces as stated > > before, please feel free to use the delete key > > Are you saying that you have motor control problems, such as Parkinson's > Disease or something? If so, I am really sorry -- but you have never said that > before, to my recollection. > > But you have never commented to anyone who has asked why you don't > switch to a proper local email client, which would fix the quoting and so on. > Do you not have access to your own computer, or something? If so I am sure > someone could give you a machine, if that would help... > > -- > Liam Proven - Profile: https://about.me/liamproven > Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com > Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven > UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: George Keremedjiev
i use email i use and suggest you use a delete key. no loss no gain... In a message dated 11/26/2018 11:16:07 AM US Mountain Standard Time, lpro...@gmail.com writes: On Mon, 26 Nov 2018 at 17:54, ED SHARPE < couryho...@aol.com> wrote: > > pay attention it us,probaby my hand which adds,Xtra spaces as stated before, > please feel free to use the delete key Are you saying that you have motor control problems, such as Parkinson's Disease or something? If so, I am really sorry -- but you have never said that before, to my recollection. But you have never commented to anyone who has asked why you don't switch to a proper local email client, which would fix the quoting and so on. Do you not have access to your own computer, or something? If so I am sure someone could give you a machine, if that would help... -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, Nov 26, 2018 at 4:28 AM Peter Corlett via cctalk < cctalk@classiccmp.org> wrote: > On Sun, Nov 25, 2018 at 07:59:13PM -0800, Fred Cisin via cctalk wrote: > [...] > > Alas, "current" computers use 8, 16, 32. They totally fail to understand > the > > intrinsic benefits of 9, 12, 18, 24, and 36 bits. > > Oh go on then, I'm curious. What are the benefits? Is it just that there > are > useful prime factors for bit-packing hacks? And if so, why not 30? > > As I understand it, 36 bits was used as it could represent a signed 10 digit decimal number in binary; the Frieden 10 digit calculator was the "gold standard" of banking and financial institutions, so to compete in that market, you computer had to be able to match the arithmetic standards. -- Charles -- X-Clacks-Overhead: GNU Terry Pratchett
Re: Text encoding Babel. Was Re: George Keremedjiev
At 10:52 PM 25/11/2018 -0700, you wrote: >> Then adds a plain ASCII space 0x20 just to be sure. > >I don't think it's adding a plain ASCII space 0x20 just to be sure. >Looking at the source of the message, I see =C2=A0, which is the UTF-8 >representation followed by the space. My MUA that understands UTF-8 >shows that "=C2=A0 " translates to " ". Further, "=C2=A0 =C2=A0" >translates to " ". I was speaking poetically. Perhaps "the mail software he uses was written by morons" is clearer. >Some of the reading that I did indicates that many things, HTML >included, use white space compaction (by default), which means that >multiple white space characters are reduced to a single white space >character. Oh yes, tell me about the html 'there is no such thing as hard formatting and you can't have any even when you want it' concept. Thank you Tim Berners Lee. http://everist.org/NobLog/20130904_Retarded_ideas_in_comp_sci.htm http://everist.org/NobLog/20140427_p-term_is_retarded.htm > So, when Ed wants multiple white spaces, his MUA has to do >something to state that two consecutive spaces can't be compacted. >Hence the non-breaking space. Except that 'non-breaking space' is mostly about inhibiting line wrap at that word gap. But anyway, there's little point trying to psychoanalyze the writers of that software. Probably involved pointy-headed bosses. >As stated in another reply, I don't think ASCII was ever trying to be >the Babel fish. (Thank you Douglas Adams.) Of course not. It was for American English only. This is one of the major points of failure in the history of information processing. >> Takeaway: Ed, one space is enough. I don't know how you got the idea >> people might miss seeing a single space, and so you need to type two or >> more. > >I wondered if it wasn't a typo or keyboard sensitivity issue. I >remember I had to really slow down the double click speed for my grandpa >(R.I.P.) so that he could use the mouse. Maybe some users actuate keys >slowly enough that the computer thinks that it's repeated keys. ¯\_(ã)_/¯ Well now he's flaunting it in his latest posts. Never mind. :) >> And since plain ASCII is hard-formatted, extra spaces are NOT ignored >> and make for wider spacing between words. > >It seems as if you made an assumption. Just because the underlying >character set is ASCII (per RFC 821 & 822, et al) does not mean that the >data that they are carrying is also ASCII. As is evident by the >Content-Type: header stating the character set of UTF-8. Containing extended Unicode character sets via UTF-8, doesn't make it a non-hard-formatted medium. In ASCII a space is a space, and multi-spaces DON'T collapse. White space collapse is a feature of html, and whether an email is html or not is determined by the sending utility. >Especially when textual white space compression does exactly that, >ignore extra white spaces. > >> Which looksvery odd, even if your mail utility didn't try to >> do something 'special' with your unusual user input. As you see, this IS NOT HTML, since those extra spaces and your diagram below would have collapsed if it was html. Also saving it as text and opening in a plain text ed or hex editor absolutely reveals what it is. >I frequently use multiple spaces with ASCII diagrams. > >+--+ >| This | >| is | >| a | >| box | >+--+ >> Btw, I changed the subject line, because this is a wider topic. I've been >> meaning to start a conversation about the original evolution of ASCII, >> and various extensions. Related to a side project of mine. > >I'm curious to know more about your side project. Hmm... the problem is it's intended to be serious, but is still far from exposure-ready. So if I talk about it now, I risk having specific terms I've coined in the doco (including the project name) getting meme-jammed or trademarked by others. The plan is to release it all in one go, eventually. Definitely will be years before that happens, if ever. However, here's a cut-n-paste (in plain text) of a section of the Introduction (html with diags.) -- Almost always, a first attempt at some unfamiliar, complex task produces a less than optimal result. Only with the knowledge gained from actually doing a new thing, can one look back and see the mistakes made. It usually takes at least one more cycle of doing it over from scratch to produce something that is optimal for the needs of the situation. Sometimes, especially where deep and subtle conceptual innovations are involved, it takes many iterations. Human development of computing science (including information coding schemes) has been effectively a 'first time effort', since we kept on developing new stuff built on top of earlier work. We almost never went back to the roots and rebuilt everything, applying insights gained from the many mistakes made. In reviewing the evolution of information coding schemes since very early stages
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, Nov 25, 2018 at 03:06:29PM -0800, Chuck Guzis via cctalk wrote: [...] > I routinely get Turkish and Greek spam in my mailbox--and I've gotten > Cyrillic-alphabet stuff as well. I had started to get slightly paranoid about the fact that there was a sudden increase in Dutch-language spam and wondered how they had figured out my physical location. On reflection, it's probably just that I now receive enough legitimate-ish email that the Bayesian filter has adjusted and no longer assumes that the correct response to Dutch text is "dat kan niet".
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, Nov 25, 2018 at 07:59:13PM -0800, Fred Cisin via cctalk wrote: [...] > Alas, "current" computers use 8, 16, 32. They totally fail to understand the > intrinsic benefits of 9, 12, 18, 24, and 36 bits. Oh go on then, I'm curious. What are the benefits? Is it just that there are useful prime factors for bit-packing hacks? And if so, why not 30?
RE: George Keremedjiev
> > #4 _You_ appear to have some "very old mail program" (to use your own > phrase) because it is screwing up your posting _and_ screwing up double > spaces. > There are no "double spaces" but for some reason he has a space and a UTF-8 non-breaking space next to each other. Most odd... It looks like the mail client is the AOL webmail client. Headers say:- Message-Id: <1674dba424c-1ec3-5...@webjas-vad199.srv.aolmail.net> X-MB-Message-Source: WebUI X-MB-Message-Type: User X-Mailer: JAS DWEB Dave > So it is you causing the problems here, I'm sorry to say. > > -- > Liam Proven - Profile: https://about.me/liamproven > Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com > Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven > UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: George Keremedjiev
On Mon, 26 Nov 2018 at 12:17, ED SHARPE via cctalk wrote: > > seems only the very old mail programs do not adapt to all character sets? Maybe so, Ed, but it's basic good manners to both (a) not make your emails unnecessarily difficult for others to read, and (b) respect the etiquette of the forum that you're posting in. You do neither. So, for instance, in your message to which I am replying, you: #1 top-post, against general mailing-list etiquette #2 fail to capitalise the sentence, against basic English rules #3 insert unnecessary double-spaces into "the very", "old mail", "programs do", and "adapt to". *And* #4 _You_ appear to have some "very old mail program" (to use your own phrase) because it is screwing up your posting _and_ screwing up double spaces. So it is you causing the problems here, I'm sorry to say. -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, 26 Nov 2018 at 01:00, Grant Taylor via cctalk wrote: > > If they are not seen as separate letters, then do their meaning's > change? Or is the different accent more for pronunciation? No, mainly, it changes alphabetical order and it makes asking questions tricky. I see š as an s-with-a-haček and if I forget the haček, I may pronounce it as an s; š = ``sh'' in English. ``č'' = "ch" in English. But that isn't how Czechs think. It's as impossible to misread or mispronounce Š as S as it would be a nonsense to mispronounced ``T'' as ``M'' in English, so people find it very hard to guess what I mean. To me, the diacritic modifies a letter, and in a word with 4 or 5 diacritics, they pile up in my head, I overload and may drop one or 2 of them. That renders the world as babel in Czech. (I chose T/M because, incredibly to me, hand-written T in Russian is written as M. Mind you, handwritten almost everything in Russian becomes mMmmmMMmm. I can read printed Cyrillic but I find handwritten stuff impossible.) > I assume that they have different meanings (if that applies to letters) > and are uses as different as "A" and "q". Yes. > > Czech is like that. Š and Č and Ž and many more that my Mac can't > > readily type are _extra letters_ which come after the unmodified form > > in the alphabet. > > ~twitch~ Yep. The Scandinavians have just 3 extras. Czech has about a dozen. https://en.wikipedia.org/wiki/Czech_orthography 42 letters (!). > I don't even know how to properly describe something that visually looks > like letters (glyphs?) to me, but may be an imprecise simplification on > my part. If it's in Roman, Cyrillic, or Greek, they're alphabets, so it's a letter. I can't read Arabic or Hebrew but I believe they're alphabets too. I don't know anything about any Asian scripts except a tiny bit of Japanese and Chinese, and they get called different things, but "character" is probably most common. > I had to zoom my font to see enough detail in Křižíkova, but it does > look like things came through just like you describe. (They even made > it through my shell script that I use to re-flow text in replies.) Good! -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
Not to beat a dead horse, but I ran across "   " in a text file when read via a web browser this evening and wanted to share my findings as they seemed timely. On 11/22/18 5:55 PM, Guy Dunphy via cctalk wrote: Anyway, I was wondering how Ed's emails (and sometimes others elsewhere) acquired that odd corruption. IMHO it's not corruption as much as it is incompatibility. Answer: Ed's email util … interpret the user typing space twice in succession, as meaning "I really, really want there to be a space here, no matter what." So it inserts a 'no-break space' unicode character, which of course requires a 2-byte UTF-8 encoding. What I'm not sure of is how the 0xC2 0xA0 translates to 0xC3 0xA2 that is the  character. I think that the 0xC2 0xA0 pair is treated as two independent characters. Thus 0xC2 is "Â", and 0xA0 is a non-breaking space. I don't know what happens to the non-breaking space, but the  and the space (0x20) that is after 0xC2 0xA0 (three byte sequence being 0xC2 0xA0 0x20) is included and becomes " " which is what we see in reply text. (Encoded as 0xC3 0x83 0x20.) So, arguably, improperly processed / translated text that results in 0xC3 0x83 0x20 / " " should have been a non-breaking space followed by a space. This jives with both Ed's email and the document that I was reading that prompted this email. Then adds a plain ASCII space 0x20 just to be sure. I don't think it's adding a plain ASCII space 0x20 just to be sure. Looking at the source of the message, I see =C2=A0, which is the UTF-8 representation followed by the space. My MUA that understands UTF-8 shows that "=C2=A0 " translates to " ". Further, "=C2=A0 =C2=A0" translates to " ". Some of the reading that I did indicates that many things, HTML included, use white space compaction (by default), which means that multiple white space characters are reduced to a single white space character. So, when Ed wants multiple white spaces, his MUA has to do something to state that two consecutive spaces can't be compacted. Hence the non-breaking space. =C2=A0 quite literally translates to a space character that can't be compacted. Thus "=C2=A0 =C2=A0" is really " " or " ". Multiple successive spaces will need to be a mixture of space and non-breaking space characters. So, the plain ASCII space 0x20 after (or before) =C2=A0 is not there just to be sure. Personally I find it more interesting than annoying. Just another example of the gradual chaotic devolution of ASCII, into a Babel of incompatible encodings. Not that ASCII was all that great in the first place. As stated in another reply, I don't think ASCII was ever trying to be the Babel fish. (Thank you Douglas Adams.) Takeaway: Ed, one space is enough. I don't know how you got the idea people might miss seeing a single space, and so you need to type two or more. I wondered if it wasn't a typo or keyboard sensitivity issue. I remember I had to really slow down the double click speed for my grandpa (R.I.P.) so that he could use the mouse. Maybe some users actuate keys slowly enough that the computer thinks that it's repeated keys. ¯\_(ツ)_/¯ But it isn't so. The normal convention in plain text is one space character between each word. The operative word is "convention", as in commonly accepted but not always the case behavior. ;-) And since plain ASCII is hard-formatted, extra spaces are NOT ignored and make for wider spacing between words. It seems as if you made an assumption. Just because the underlying character set is ASCII (per RFC 821 & 822, et al) does not mean that the data that they are carrying is also ASCII. As is evident by the Content-Type: header stating the character set of UTF-8. Especially when textual white space compression does exactly that, ignore extra white spaces. Which looksvery odd, even if your mail utility didn't try to do something 'special' with your unusual user input. I frequently use multiple spaces with ASCII diagrams. +--+ | This | | is | | a | | box | +--+ That will not look like I intended it with white space compression. Btw, I changed the subject line, because this is a wider topic. I've been meaning to start a conversation about the original evolution of ASCII, and various extensions. Related to a side project of mine. I'm curious to know more about your side project. -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
Therefore, for use with current computers, 32 bits would be needed. Some games can be played with mixing sizes by doing things like setting high bit, for 128 7 bit characters plus 32768 15 bit characters, and 2147483648 31 bit characters. On Sun, 25 Nov 2018, ben via cctalk wrote: REAL COMPUTERS USE 18 BITS... RUNS BEN. Alas, "current" computers use 8, 16, 32. They totally fail to understand the intrinsic benefits of 9, 12, 18, 24, and 36 bits.
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/2018 6:34 PM, Fred Cisin via cctalk wrote: On Mon, 26 Nov 2018, Tomasz Rola via cctalk wrote: To supply this train of thought with some numbers: - my copy of Common Lisp HyperSpec claims 978 symbols (i.e. words) on its alphabetical index; many words have modifiers (a.k.a. keyword options, with default values) which increases the number at least twofold, IMHO, if one agrees that each combo should be counted as different word, to which I would say yes - I have read somewhere that Japanese pupil after graduating from elementary school is supposed to know 1000 kanjis by heart (there is a standardised set, I have a book) Would those "modifiers of words" qualify as ADJECTIVES? The Japanes phonetic alphabets, Katakana and Hirigana, have 46 letters each, almost twice that with diacritics. I have heard that Japanese Kanji has more than 50,000 words/characters (for which 16bits would fit, but be a little risky). But, that in common usage, 1100 to 2000 words comprise most of common usage. Wikipedia says that as of 2010, the student requirement is 2136. Japanese Kanji and Chinese have substantial overlap, but there is no way that you could squeeze both into 16 bits, without leaving out important stuff. Therefore, for use with current computers, 32 bits would be needed. Some games can be played with mixing sizes by doing things like setting high bit, for 128 7 bit characters plus 32768 15 bit characters, and 2147483648 31 bit characters. REAL COMPUTERS USE 18 BITS... RUNS BEN.
Re: Text encoding Babel. Was Re: George Keremedjiev
ASL is quite different than English... you can sign in English or you can sign in ASL The ASL has a different sentence structure. When I was first learning about the Deaf Teletype revolution (We have a collection of a diverse group of TTY both mechanical and CRT and portable and ... I would correspond via email with a young person that sold us some ttys and wondered why it was almost a different sentence structure, almost like Yoda but if you look at both closely not really the exact same. Hard to explain... but English and ASL utilize 2 different Sentence structuring ... or so it appears to me. If you learn ASL and Signing well there is a good need for excellent interpreters out there. And yes, always looking for ANYTHING related to the history of TTY and other assertive communications devices. Ed# In a message dated 11/25/2018 5:46:55 PM US Mountain Standard Time, cctalk@classiccmp.org writes: There are still MANY schools arguing about whether to accept ASL (American Sign Language, as used by Deaf people). I would think that therefore, BSL (British Sign Language) should qualify
Re: e-mail, character sets, encodings (was Re: George Keremedjiev)
On 2018-11-25 7:45 PM, Bill Gunshannon via cctalk wrote: > It's not a mailing list problem. It's not even a mail problem. It's a > > Mail User Agent problem. It is a display problem. It is up to the > > users mail program to display the email as it was sent. Unless the > Did you really double space this email like a high school essay? Don't see that every day. --T
Re: Text encoding Babel. Was Re: George Keremedjiev
On Mon, 26 Nov 2018, Tomasz Rola via cctalk wrote: To supply this train of thought with some numbers: - my copy of Common Lisp HyperSpec claims 978 symbols (i.e. words) on its alphabetical index; many words have modifiers (a.k.a. keyword options, with default values) which increases the number at least twofold, IMHO, if one agrees that each combo should be counted as different word, to which I would say yes - I have read somewhere that Japanese pupil after graduating from elementary school is supposed to know 1000 kanjis by heart (there is a standardised set, I have a book) Would those "modifiers of words" qualify as ADJECTIVES? The Japanes phonetic alphabets, Katakana and Hirigana, have 46 letters each, almost twice that with diacritics. I have heard that Japanese Kanji has more than 50,000 words/characters (for which 16bits would fit, but be a little risky). But, that in common usage, 1100 to 2000 words comprise most of common usage. Wikipedia says that as of 2010, the student requirement is 2136. Japanese Kanji and Chinese have substantial overlap, but there is no way that you could squeeze both into 16 bits, without leaving out important stuff. Therefore, for use with current computers, 32 bits would be needed. Some games can be played with mixing sizes by doing things like setting high bit, for 128 7 bit characters plus 32768 15 bit characters, and 2147483648 31 bit characters.
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, Nov 25, 2018 at 04:46:50PM -0800, Fred Cisin via cctalk wrote: [...] > Is FORTRAN considered modern enough? [...] > What about APL? Although its structure is fairly straight-forward, > it does, indeed, have a unique character set. To supply this train of thought with some numbers: - my copy of Common Lisp HyperSpec claims 978 symbols (i.e. words) on its alphabetical index; many words have modifiers (a.k.a. keyword options, with default values) which increases the number at least twofold, IMHO, if one agrees that each combo should be counted as different word, to which I would say yes - I have read somewhere that Japanese pupil after graduating from elementary school is supposed to know 1000 kanjis by heart (there is a standardised set, I have a book) -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_r...@bigfoot.com **
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, 25 Nov 2018, Frank McConnell via cctalk wrote: I have been told that in the 1960s taking a course in FORTRAN programming fulfilled the foreign language requirement at UC Berkeley. Not currently, and I have some doubt about then. But, there are conflicting staatements. One section requires that it be a MODERN language, but with specific exceptions for ASL and "classical languages, such as Latin and Greek". Is FORTRAN considered modern enough? There are still MANY schools arguing about whether to accept ASL (American Sign Language, as used by Deaf people). I would think that therefore, BSL (British Sign Language) should qualify. What about APL? Although its structure is fairly straight-forward, it does, indeed, have a unique character set.
Re: e-mail, character sets, encodings (was Re: George Keremedjiev)
It's not a mailing list problem. It's not even a mail problem. It's a Mail User Agent problem. It is a display problem. It is up to the users mail program to display the email as it was sent. Unless the user doesn't want to see anything in character sets other than their favorite. Nothing along the way should change anything in an email message. The endpoint should receive whatever the beginning point sent out and either handle it or not. But it is the endpoints responsibility to try to display it accurately. I often send emails (and post on USENET) characters that are not a part of ASCII or the English alphabet. I certainly don't want someone in between to modify what I send. bill On 11/25/18 7:00 PM, ED SHARPE via cctalk wrote: > Hi Frank and others- > Yea it is only here we have the problem. or at leased this is the > only list serve that does not like it. > > I wondered if something could be handled at the listserv end or not > but I have littleknowledge of list serves alas... > > Sad when people spent more time on characters rather than George the > museum archivist that passed away. > > > George worked his ass off to achieve what he did. > > Google him and read about his early days. You will be surprised and > you might find yourself thankful for how easy you had it. > > I did not know him all that well but I did provide his PDP-8 classic > with the plexis when He was first starting up It was a beauty and in the > 200 serial number range as I remember. We kept #18 classic Plexi for > SMECC > > I had not planned on selling it as always handy to have a #2 for an > offsite display and you do not have to disturb the in-house display but > George seems so focused and intense on making a museum too so who > could say no to that? I wish I had. traveled to see his effort up > close. > > Project this week is to find someone one with a UNIVAC 422 or the > predecessor UNIVAC Digital trainer. I can NOT BELIEVE I am fortunate enough > to be the only one with a UNIVAC 422' > > That is all for now... I think I hear a half of turkey and leftover > dressing in the refrig wailing to be consumed. > > Ed# www.smecc.org > > > > In a message dated 11/25/2018 4:32:34 PM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > Most mail servers sending inbound messages to the list include the encoding > > scheme in the header. The mailer program should process and translate the > email message body accordingly...in theory anyway. The set up and testing > of a sampling of encoding variations would reveal which interpreters were > missing in our particular list's relay process. Someone could create tests > with the most common 20 or so encoding schemes and a character set dump and > document the results etc. Anyone have the time for that? I dont really > think asking persons to fix their email program is the solution, it's a > mailing list fix/enhancement. I bet there is documentation on such a > procedure I can't imagine we are the first to encounter this problem. It's > fixable > B > > On Sun, Nov 25, 2018, 3:24 PM Frank McConnell via cctalk < > cctalk@classiccmp.org wrote: > >> Very old mail programs indeed have no understanding whatsoever of >> character sets or encoding. They simply display data from the e-mail file >> on stdout or equivalent. If you are lucky, the character set and encoding >> in the e-mail match the character set and encoding used by your terminal. >> >> The early-to-mid-1990s MIME work was in some part about allowing e-mail to >> indicate its character set and encoding, because at that point in time >> there were many character sets and multiple encodings. Before that, you >> had to figure them out from your correspondent's e-mail address and the >> mess on your screen or printout. >> >> And really it's not just about the mail program, it's about the host >> operating system and the hardware on which it runs and which you are using >> to view e-mail. Heavy-metal characters are likely to look funny on a >> terminal built to display US-ASCII like an HP 2645. Your chances get >> better if the software has enough understanding of various Roman-language >> text encodings and you are using an HP 2622 with HP-ROMAN8 character >> support and the connection between your host and terminal is >> eight-bit-clean. But then you get something that uses Cyrillic and now >> you're looking at having another HP 2645 set up to do Russian. And hoping >> your host software knows how to deal with those character sets and >> encodings too! >> >> -Frank McConnell >> >> On Nov 25, 2018, at 9:55, ED SHARPE via cctalk wrote: >>> seems only the very old mail programs do not adapt to all character >> sets? >>> >>> In a message dated 11/25/2018 6:19:52 AM US Mountain Standard Time, >> cctalk@classiccmp.org writes: >>> >>>
Re: Text encoding Babel. Was Re: George Keremedjiev
We have a tendency to be remarkably ethnocentric. When you apply for a job, do you send them a copy of your RESUME? There is an exit on 280 for "La Canada" road. For most European languages (I did say MOST), an 8 bit extended ASCII could be adequate. "Recently" (1981), I was disappointed in IBM's character extensions for the 5150. We got smiley faces, but not even pound-sterling nor Yen! 16 bits would presumably be adequate for designing a character set for most phonetic alphabets. (I did say MOST). When I got my Epson HC-20's (like the HX-20, but including Katakana), and my Epson RC-20 (wristwatch, Z80 like, with RAM, ROM, and serial port) I started to try to learn a little Japanese. I didn't get very far, but I did at least learn the sounds of Katakana, and could sound out words written in it (a LOT of computer materials use Katakana for non-Japanese words, such as "monitor") But, full inclusion of pictographic languages (Kanji, etc.) would require more than 16 bits. -- Grumpy Ol' Fred ci...@xenosoft.com
Re: e-mail, character sets, encodings (was Re: George Keremedjiev)
Hi Frank and others- Yea it is only here we have the problem. or at leased this is the only list serve that does not like it. I wondered if something could be handled at the listserv end or not but I have littleknowledge of list serves alas... Sad when people spent more time on characters rather than George the museum archivist that passed away. George worked his ass off to achieve what he did. Google him and read about his early days. You will be surprised and you might find yourself thankful for how easy you had it. I did not know him all that well but I did provide his PDP-8 classic with the plexis when He was first starting up It was a beauty and in the 200 serial number range as I remember. We kept #18 classic Plexi for SMECC I had not planned on selling it as always handy to have a #2 for an offsite display and you do not have to disturb the in-house display but George seems so focused and intense on making a museum too so who could say no to that? I wish I had. traveled to see his effort up close. Project this week is to find someone one with a UNIVAC 422 or the predecessor UNIVAC Digital trainer. I can NOT BELIEVE I am fortunate enough to be the only one with a UNIVAC 422' That is all for now... I think I hear a half of turkey and leftover dressing in the refrig wailing to be consumed. Ed# www.smecc.org In a message dated 11/25/2018 4:32:34 PM US Mountain Standard Time, cctalk@classiccmp.org writes: Most mail servers sending inbound messages to the list include the encoding scheme in the header. The mailer program should process and translate the email message body accordingly...in theory anyway. The set up and testing of a sampling of encoding variations would reveal which interpreters were missing in our particular list's relay process. Someone could create tests with the most common 20 or so encoding schemes and a character set dump and document the results etc. Anyone have the time for that? I dont really think asking persons to fix their email program is the solution, it's a mailing list fix/enhancement. I bet there is documentation on such a procedure I can't imagine we are the first to encounter this problem. It's fixable B On Sun, Nov 25, 2018, 3:24 PM Frank McConnell via cctalk < cctalk@classiccmp.org wrote: > Very old mail programs indeed have no understanding whatsoever of > character sets or encoding. They simply display data from the e-mail file > on stdout or equivalent. If you are lucky, the character set and encoding > in the e-mail match the character set and encoding used by your terminal. > > The early-to-mid-1990s MIME work was in some part about allowing e-mail to > indicate its character set and encoding, because at that point in time > there were many character sets and multiple encodings. Before that, you > had to figure them out from your correspondent's e-mail address and the > mess on your screen or printout. > > And really it's not just about the mail program, it's about the host > operating system and the hardware on which it runs and which you are using > to view e-mail. Heavy-metal characters are likely to look funny on a > terminal built to display US-ASCII like an HP 2645. Your chances get > better if the software has enough understanding of various Roman-language > text encodings and you are using an HP 2622 with HP-ROMAN8 character > support and the connection between your host and terminal is > eight-bit-clean. But then you get something that uses Cyrillic and now > you're looking at having another HP 2645 set up to do Russian. And hoping > your host software knows how to deal with those character sets and > encodings too! > > -Frank McConnell > > On Nov 25, 2018, at 9:55, ED SHARPE via cctalk wrote: > > > > seems only the very old mail programs do not adapt to all character > sets? > > > > > > In a message dated 11/25/2018 6:19:52 AM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > > > > > > > >> On Nov 21, 2018, at 4:46 PM, Bill Gunshannon via cctalk < > cctalk@classiccmp.org> wrote: > >> > >> > >>> On 11/21/18 5:19 PM, Fred Cisin via cctalk wrote: > >>> Ed, > >>> It is YOUR mail program that is doing the extraneous insertions, and > >>> then not showing them to you when you view your own messages. > >>> > >>> ALL of us see either extraneous characters, or extraneous spaces in > >>> everything that you send! > >>> I use PINE in a shell account, and they show up as a whole bunch of > >>> inappropriate spaces. > >>> > >>> Seriously, YOUR mail program is inserting extraneous stuff. > >>> Everybody? but you sees it. > >>> > >> > >> I don't. I didn't see it until someone replied with a > >> > >> copy of the offending text included. > >> > >> > >> bill > >> > > same here. i didnt see them until some replies included the text. > > > > kelly > > > >
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/18 3:53 PM, Liam Proven wrote: It's been enlightening! :-) Some I was ready for. E.g. In French or Spanish, both of which I can speak to some extent, letters like á or ó are not seen as separate letters: French would call them a-acute, an a with an acute accent. Ç is a c with a cedilla. Etc. If they are not seen as separate letters, then do their meaning's change? Or is the different accent more for pronunciation? But in Swedish/Norwegian/Danish -- I speak basic Norwegian and rudimentary Swedish -- ø and å and ä and so on are not a or o with accents on: they are _different letters_ that come at the end of the alphabet. I assume that they have different meanings (if that applies to letters) and are uses as different as "A" and "q". Czech is like that. Š and Č and Ž and many more that my Mac can't readily type are _extra letters_ which come after the unmodified form in the alphabet. ~twitch~ I don't even know how to properly describe something that visually looks like letters (glyphs?) to me, but may be an imprecise simplification on my part. Without them, you can't write correct Czech. It's worse than writing English without the letter E. Usually you can guess but not always. Byt means flat, apartment; b y-acute t means the verb "to be". You can probably work that out, but you can't always. A restaurant menu would be hopelessly corrupted as both "raw" and "with cheese" are quite likely. Indeed. Sure, my office street name: Křižíkova K, r haček, i, z haček, i acute, k o v a. I had to zoom my font to see enough detail in Křižíkova, but it does look like things came through just like you describe. (They even made it through my shell script that I use to re-flow text in replies.) A hacek is like an upside down circumflex: ^ Also known as a caron. ACK Oh yes. It's quite a minefield. /me blinks and shakes his head. Czech keyboards have so many extra letters, the *numbers* are on shift combinations! ~chuckle~ Well yes. I believe Mr Corlett here rejects all mail from gmail.com -- except mine... ;-) ¯\_(ツ)_/¯ -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
On Nov 25, 2018, at 15:44, Sean Conner wrote: > I even heard of a high school in Tennessee who said computer languages > fulfill the "foreign language requirements" ... who'da thunk? I have been told that in the 1960s taking a course in FORTRAN programming fulfilled the foreign language requirement at UC Berkeley. -Frank McConnell
Re: e-mail, character sets, encodings (was Re: George Keremedjiev)
On 11/25/18 4:32 PM, Bill Degnan via cctalk wrote: Most mail servers sending inbound messages to the list include the encoding scheme in the header. The mailer program should process and translate the email message body accordingly...in theory anyway. Most email handling programs don't need to bother with what the data is, as they just move the data. This largely includes email list managers. This really only becomes a concern if something is modifying part of the message (data) as it moves through the system. The set up and testing of a sampling of encoding variations would reveal which interpreters were missing in our particular list's relay process. cctalk is using Mailman, and I'm fairly sure that Mailman does handle this properly. Or if there is a bug it has likely been found & resolved. In the event that a bug is found, I think that it would be best to report it upstream to Mailman so they can fix it, and then install the updates when they are released. Someone could create tests with the most common 20 or so encoding schemes and a character set dump and document the results etc. Anyone have the time for that? I doubt that this is necessary. Based on what I've seen, Mailman is handling the message (data) just fine. It's passing the Ed's messages with the UTF-8 =C2=A0 (quoted-printable) encoded parts just fine. I dont really think asking persons to fix their email program is the solution I agree that it's asking an end user to fix their email client is the most viable solution. it's a mailing list fix/enhancement. I disagree. I'm not convinced that this is a problem in email. I question how many people are seeing the symptoms -and- what email client they are using. If someone knowingly chooses to use an email client that doesn't support UTF-8, then ¯\_(ツ)_/¯ That's their choice. I just hope that they are informed in their choice. I bet there is documentation on such a procedure I can't imagine we are the first to encounter this problem. It's fixable If you really do think that this is a problem with the mailing list, I'd suggest bringing the problem up on the Mailman mailing list. Mark S. is very responsive and can help people fix problems / configurations in short order. -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
It was thus said that the Great Bill Gunshannon via cctalk once stated: > > On 11/25/18 5:42 PM, Grant Taylor via cctalk wrote: > > On 11/23/18 5:52 AM, Peter Corlett via cctalk wrote: > >> Worse than that, it's *American* ignorance and cultural snobbery > >> which also affects various English-speaking countries. > > > > Please do not ascribe such ignorance with such a broad brush, at least > > not without qualifiers that account for people that do try to respect > > other people's cultures. > > > > > Q. What do you call someone who speaks three languages? > > A. Trilingual. > > Q. What do you call someone who speaks two languages? > > A. Bilingual. > > Q. What do you call someone who speaks one language? > > A. American. As an American, a friend of mine from Sweden (who himself speaks at least three languages) considered me multilingual. Of course, my other languages are BASIC, Assembly, C, Forth ... I even heard of a high school in Tennessee who said computer languages fulfill the "foreign language requirements" ... who'da thunk? > OK, it's a joke. (I'm American and speak 4 languages.) -spc (Who speaks English and perhaps a dozen words in German, but plenty of computer languages ... )
Re: e-mail, character sets, encodings (was Re: George Keremedjiev)
Most mail servers sending inbound messages to the list include the encoding scheme in the header. The mailer program should process and translate the email message body accordingly...in theory anyway. The set up and testing of a sampling of encoding variations would reveal which interpreters were missing in our particular list's relay process. Someone could create tests with the most common 20 or so encoding schemes and a character set dump and document the results etc. Anyone have the time for that? I dont really think asking persons to fix their email program is the solution, it's a mailing list fix/enhancement. I bet there is documentation on such a procedure I can't imagine we are the first to encounter this problem. It's fixable B On Sun, Nov 25, 2018, 3:24 PM Frank McConnell via cctalk < cctalk@classiccmp.org wrote: > Very old mail programs indeed have no understanding whatsoever of > character sets or encoding. They simply display data from the e-mail file > on stdout or equivalent. If you are lucky, the character set and encoding > in the e-mail match the character set and encoding used by your terminal. > > The early-to-mid-1990s MIME work was in some part about allowing e-mail to > indicate its character set and encoding, because at that point in time > there were many character sets and multiple encodings. Before that, you > had to figure them out from your correspondent's e-mail address and the > mess on your screen or printout. > > And really it's not just about the mail program, it's about the host > operating system and the hardware on which it runs and which you are using > to view e-mail. Heavy-metal characters are likely to look funny on a > terminal built to display US-ASCII like an HP 2645. Your chances get > better if the software has enough understanding of various Roman-language > text encodings and you are using an HP 2622 with HP-ROMAN8 character > support and the connection between your host and terminal is > eight-bit-clean. But then you get something that uses Cyrillic and now > you're looking at having another HP 2645 set up to do Russian. And hoping > your host software knows how to deal with those character sets and > encodings too! > > -Frank McConnell > > On Nov 25, 2018, at 9:55, ED SHARPE via cctalk wrote: > > > > seems only the very old mail programs do not adapt to all character > sets? > > > > > > In a message dated 11/25/2018 6:19:52 AM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > > > > > > > >> On Nov 21, 2018, at 4:46 PM, Bill Gunshannon via cctalk < > cctalk@classiccmp.org> wrote: > >> > >> > >>> On 11/21/18 5:19 PM, Fred Cisin via cctalk wrote: > >>> Ed, > >>> It is YOUR mail program that is doing the extraneous insertions, and > >>> then not showing them to you when you view your own messages. > >>> > >>> ALL of us see either extraneous characters, or extraneous spaces in > >>> everything that you send! > >>> I use PINE in a shell account, and they show up as a whole bunch of > >>> inappropriate spaces. > >>> > >>> Seriously, YOUR mail program is inserting extraneous stuff. > >>> Everybody? but you sees it. > >>> > >> > >> I don't. I didn't see it until someone replied with a > >> > >> copy of the offending text included. > >> > >> > >> bill > >> > > same here. i didnt see them until some replies included the text. > > > > kelly > > > >
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/18 3:51 PM, Bill Gunshannon via cctalk wrote: Q. What do you call someone who speaks three languages? A. Trilingual. Q. What do you call someone who speaks two languages? A. Bilingual. Q. What do you call someone who speaks one language? A. American. Monolingual. OK, it's a joke. (I'm American and speak 4 languages.) I've heard it before. I know there are a LOT of monolingual people in the world that don't live in the U.S.A. But I'll guess that percentage wise, the U.S.A. is probably up there for monolingual people. -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/18 6:06 PM, Chuck Guzis via cctalk wrote: > On 11/25/18 2:53 PM, Liam Proven via cctalk wrote: >> On Sun, 25 Nov 2018 at 23:42, Grant Taylor via cctalk >> wrote: >> >>> I bet you see all sorts of things that I'm ignorant of. >> It's been enlightening! > I routinely get Turkish and Greek spam in my mailbox--and I've gotten > Cyrillic-alphabet stuff as well. > > Shrug. We all live on the same planet. > I live in the US and while I see less of it now than I used to, at the University I used to get SPAM in Korean, Chinese, Japanese, Cyrillic, Arabic, Hebrew and a couple of time even Amharic. Thus the reason ASCII is no longer the "standard". bill
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/18 2:53 PM, Liam Proven via cctalk wrote: > On Sun, 25 Nov 2018 at 23:42, Grant Taylor via cctalk > wrote: > >> I bet you see all sorts of things that I'm ignorant of. > > It's been enlightening! I routinely get Turkish and Greek spam in my mailbox--and I've gotten Cyrillic-alphabet stuff as well. Shrug. We all live on the same planet. --Chuck
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sun, 25 Nov 2018 at 23:42, Grant Taylor via cctalk wrote: > I bet you see all sorts of things that I'm ignorant of. It's been enlightening! Some I was ready for. E.g. In French or Spanish, both of which I can speak to some extent, letters like á or ó are not seen as separate letters: French would call them a-acute, an a with an acute accent. Ç is a c with a cedilla. Etc. But in Swedish/Norwegian/Danish -- I speak basic Norwegian and rudimentary Swedish -- ø and å and ä and so on are not a or o with accents on: they are _different letters_ that come at the end of the alphabet. Czech is like that. Š and Č and Ž and many more that my Mac can't readily type are _extra letters_ which come after the unmodified form in the alphabet. Without them, you can't write correct Czech. It's worse than writing English without the letter E. Usually you can guess but not always. Byt means flat, apartment; b y-acute t means the verb "to be". You can probably work that out, but you can't always. A restaurant menu would be hopelessly corrupted as both "raw" and "with cheese" are quite likely. > > For example, right now, I am in my office in Křižíkova. I can't > > type that name correctly without Unicode characters, because the ANSI > > character set doesn't contain enough letters for Czech. > > Intriguing. Is there an old MS-DOS Code Page (or comparable technique) > that does encompass the necessary characters? Don't know. But I suspect there weren't many PCs here before the Velvet Revolution in 1989. Democracy came around the time of Windows 3.0 so there may not have been much of a commerical drive. > Would you please provide an example? Sure, my office street name: Křižíkova > (I'm curious if my email client > will display things properly.) K, r haček, i, z haček, i acute, k o v a. A hacek is like an upside down circumflex: ^ Also known as a caron. > Oh my. I had no idea that accent characters made such a difference. But > I consider that to be my personal ignorance living in the U.S.A. I do > NOT think it's anybody's fault by my own. I'll defend others if someone > tries to say that their native / local regional norm is the problem. Oh yes. It's quite a minefield. Czech keyboards have so many extra letters, the *numbers* are on shift combinations! > I will say that I think everybody has their own individual prerogative > to filter email as they see fit. They just need to know that they are > doing and own the fact that they might be causing unintentional harm. > > P.S. Resending from the correct email address. — A recent Thunderbird > update broke the Correct-Identity add-on. :-( Well yes. I believe Mr Corlett here rejects all mail from gmail.com -- except mine... ;-) -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/25/18 5:42 PM, Grant Taylor via cctalk wrote: > On 11/23/18 5:52 AM, Peter Corlett via cctalk wrote: >> Worse than that, it's *American* ignorance and cultural snobbery >> which also affects various English-speaking countries. > > Please do not ascribe such ignorance with such a broad brush, at least > not without qualifiers that account for people that do try to respect > other people's cultures. > > Q. What do you call someone who speaks three languages? A. Trilingual. Q. What do you call someone who speaks two languages? A. Bilingual. Q. What do you call someone who speaks one language? A. American. OK, it's a joke. (I'm American and speak 4 languages.) bill
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/23/18 11:27 AM, Tomasz Rola via cctalk wrote: Well, that was low hanging fruit. But if he indeed turns it off and the problem is not gone, that will be a bit of puzzle. Will require some way to compare mailboxes in search of pattern in missing emails... Which may or may not be obvious... which will lead to more puzzles... oy maybe I should have stayed muted and let others do the job... I'd question modern anti-spam techniques like DMARC and DKIM. I'd suggest checking the mailing list to see if there is any information about bounces. You can probably see crumbs of missing messages in message flow (likely already happening), the References: & In-Reply-To: headers, and the list archive. -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/23/18 4:12 AM, Liam Proven via cctalk wrote: That's English-language cultural snobbery. I don't think I'd go that far. I'd suspect it's an unfortunate false positive of a spam filtering technique that Guy uses. Does the technique have some negative side effects? Sure. Are said side effects intentional? I doubt it. I'm a native Anglophone but I live in a non-English speaking country, Czechia. I bet you see all sorts of things that I'm ignorant of. For example, right now, I am in my office in Křižíkova. I can't type that name correctly without Unicode characters, because the ANSI character set doesn't contain enough letters for Czech. Intriguing. Is there an old MS-DOS Code Page (or comparable technique) that does encompass the necessary characters? It can cope with some Western European letters needed for Spanish, French etc., but not even enough for the Norwegian letter ``ø''. So I can type the name of the district of Prague I'm in -- Karlín -- and you'll probably see that, but the street name, I am guessing not. Would you please provide an example? (I'm curious if my email client will display things properly.) Feel free to pick any example that you like so that you don't have to reveal information you might want to keep private. "Krizikova" is usually close enough but it's not correct. Those letters are important. E.g. "sýrové" means cheesy, but "syrové" means raw. That's a significant difference. Oh my. I had no idea that accent characters made such a difference. But I consider that to be my personal ignorance living in the U.S.A. I do NOT think it's anybody's fault by my own. I'll defend others if someone tries to say that their native / local regional norm is the problem. It matters to me and I'm not even Czech and don't speak it particularly well... Fair enough. So if you tried to mail me something at work -- the address I normally use, for instance for the Alphasmart Dana Wireless on the way to to me from Baltimore right now -- and you get a reply saying "package for [streetname] undeliverable" in the subject -- you'd just reject it. That's basically discriminating against people who don't speak your language, and in my book, that's not OK. I will say that I think everybody has their own individual prerogative to filter email as they see fit. They just need to know that they are doing and own the fact that they might be causing unintentional harm. P.S. Resending from the correct email address. — A recent Thunderbird update broke the Correct-Identity add-on. :-( -- Grant. . . . unix || die
Re: Text encoding Babel. Was Re: George Keremedjiev
On 11/23/18 5:52 AM, Peter Corlett via cctalk wrote: Worse than that, it's *American* ignorance and cultural snobbery which also affects various English-speaking countries. Please do not ascribe such ignorance with such a broad brush, at least not without qualifiers that account for people that do try to respect other people's cultures. The pound sign is not in US-ASCII, and the euro sign is not in ISO-8859-1, for example. Well, seeing as how ASCII, the /American/ Standard Code for Information Interchange, is inherently /American/, I don't personally fault it for not having currency symbols for other languages / regions. Instead, I consider ASCII to be a limited standard. Hence why so much effort has gone into other standards to overcome this, and other, limitation(s). I do not know for sure, but I'm confident that other character sets don't have characters / glyphs from other languages. I'm sure that there is room for a discussion of why ASCII is used as the underlying character set for network services and the imposition that it imposes on international friends and colleagues. Amusingly, peering through my inbox in which I have mail in both Dutch and English, the only one with a UTF-8 subject line is in English. It was probably composed on a Windows box which "helpfully" turned a hyphen into an en-dash. I'm trying to NOT search my mailbox. I'd be more curious about the number of bodies that contain UTF-8 or UTF-16 that can encode more characters / glyphs. It's my understanding that without some special quite modern extensions, non-ASCII is shunned in headers, including the Subject: header. P.S. Resending from the correct email address. — A recent Thunderbird update broke the Correct-Identity add-on. :-( -- Grant. . . . unix || die
e-mail, character sets, encodings (was Re: George Keremedjiev)
Very old mail programs indeed have no understanding whatsoever of character sets or encoding. They simply display data from the e-mail file on stdout or equivalent. If you are lucky, the character set and encoding in the e-mail match the character set and encoding used by your terminal. The early-to-mid-1990s MIME work was in some part about allowing e-mail to indicate its character set and encoding, because at that point in time there were many character sets and multiple encodings. Before that, you had to figure them out from your correspondent's e-mail address and the mess on your screen or printout. And really it's not just about the mail program, it's about the host operating system and the hardware on which it runs and which you are using to view e-mail. Heavy-metal characters are likely to look funny on a terminal built to display US-ASCII like an HP 2645. Your chances get better if the software has enough understanding of various Roman-language text encodings and you are using an HP 2622 with HP-ROMAN8 character support and the connection between your host and terminal is eight-bit-clean. But then you get something that uses Cyrillic and now you're looking at having another HP 2645 set up to do Russian. And hoping your host software knows how to deal with those character sets and encodings too! -Frank McConnell On Nov 25, 2018, at 9:55, ED SHARPE via cctalk wrote: > > seems only the very old mail programs do not adapt to all character > sets? > > > In a message dated 11/25/2018 6:19:52 AM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > > >> On Nov 21, 2018, at 4:46 PM, Bill Gunshannon via cctalk >> wrote: >> >> >>> On 11/21/18 5:19 PM, Fred Cisin via cctalk wrote: >>> Ed, >>> It is YOUR mail program that is doing the extraneous insertions, and >>> then not showing them to you when you view your own messages. >>> >>> ALL of us see either extraneous characters, or extraneous spaces in >>> everything that you send! >>> I use PINE in a shell account, and they show up as a whole bunch of >>> inappropriate spaces. >>> >>> Seriously, YOUR mail program is inserting extraneous stuff. >>> Everybody? but you sees it. >>> >> >> I don't. I didn't see it until someone replied with a >> >> copy of the offending text included. >> >> >> bill >> > same here. i didnt see them until some replies included the text. > > kelly >
Re: George Keremedjiev
seems only the very old mail programs do not adapt to all character sets? In a message dated 11/25/2018 6:19:52 AM US Mountain Standard Time, cctalk@classiccmp.org writes: > On Nov 21, 2018, at 4:46 PM, Bill Gunshannon via cctalk > wrote: > > >> On 11/21/18 5:19 PM, Fred Cisin via cctalk wrote: >> Ed, >> It is YOUR mail program that is doing the extraneous insertions, and >> then not showing them to you when you view your own messages. >> >> ALL of us see either extraneous characters, or extraneous spaces in >> everything that you send! >> I use PINE in a shell account, and they show up as a whole bunch of >> inappropriate spaces. >> >> Seriously, YOUR mail program is inserting extraneous stuff. >> Everybody? but you sees it. >> > > I don't. I didn't see it until someone replied with a > > copy of the offending text included. > > > bill > same here. i didnt see them until some replies included the text. kelly
Re: Text encoding Babel. Was Re: George Keremedjiev
At 07:27 PM 23/11/2018 +0100, you wrote: >On Fri, Nov 23, 2018 at 07:01:17PM +0100, Liam Proven wrote: >> On Fri, 23 Nov 2018 at 18:54, Tomasz Rola via cctalk >> wrote: >> > >> > Turn off trashing mails with Unicode in Subject and see if this solves >> > a problem? >> >> *Loud laughter in the office* >> >> Well _played_, sir! > >Well, that was low hanging fruit. Yes, I should have pre-empted that one. But glad it gave someone a laugh. >But if he indeed turns it off and >the problem is not gone, that will be a bit of puzzle. It's not related. My cctalk filter runs before the UTF-8 trash filter, and I check the trashbin regularly. >Will require >some way to compare mailboxes in search of pattern in missing >emails... Which may or may not be obvious... which will lead to more >puzzles... oy maybe I should have stayed muted and let others do the >job... Here's one check. See attached screen-cap of cctalk emails. Usually many per day, but only one per day on the 15th & 16th Nov, none at all on the 17th. Did the list actually go silent then? It's possible by random ebb and flow, or maybe everyone was in shock over the awful Paradise fire death toll. Which may be over 1000, unless a lot of people listed as missing do turn up. Guy
Re: George Keremedjiev
> On Nov 21, 2018, at 4:46 PM, Bill Gunshannon via cctalk > wrote: > > >> On 11/21/18 5:19 PM, Fred Cisin via cctalk wrote: >> Ed, >> It is YOUR mail program that is doing the extraneous insertions, and >> then not showing them to you when you view your own messages. >> >> ALL of us see either extraneous characters, or extraneous spaces in >> everything that you send! >> I use PINE in a shell account, and they show up as a whole bunch of >> inappropriate spaces. >> >> Seriously, YOUR mail program is inserting extraneous stuff. >> Everybody? but you sees it. >> > > I don't. I didn't see it until someone replied with a > > copy of the offending text included. > > > bill > same here. i didnt see them until some replies included the text. kelly
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, Nov 23, 2018 at 11:44:23PM +0100, Tomasz Rola wrote: [...] > Just my wet phantasies about how such things work or might work. It > only requires one lousy admin to make it true, or a good one fired and > never to be heard from again. > > Perhaps asking your ISP could give you some clues. Perhaps this is > even more horrific (micro black holes? aliens tuning in?) and wetter > than my wettest dreams. The huge problem with wet phantasies is that they take over and distract the dreamer. The first thing I should have asked: is this problem limited only to mails from cctalk? If yes, then the most probably culprit would be list's server. -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_r...@bigfoot.com **
Re: Text encoding Babel. Was Re: George Keremedjiev
On Sat, Nov 24, 2018 at 08:56:09AM +1100, Guy Dunphy wrote: > Resend, just in case that screen-cap image attachment fails. It is also here: > http://everist.org/6F2a/cctalk_rcvd.png > > >Will require > >some way to compare mailboxes in search of pattern in missing > >emails... Which may or may not be obvious... which will lead to more > >puzzles... oy maybe I should have stayed muted and let others do the > >job... > > Here's one check. See attached screen-cap of cctalk emails. Usually many per > day, but only one per day on the 15th & 16th Nov, none at all on the 17th. > Did the list actually go silent then? It's possible by random ebb and flow, > or maybe everyone was in shock over the awful Paradise fire death toll. > Which may be over 1000, unless a lot of people listed as missing do turn up. Ok, here is a c-pasted fragment from my mutt's index view, limited to messages from cctalk & cctech (which hopefully shows what I expect). The first column is message number in my mailbox, they are not consecutive because in between I got messages from other mailing lists and spammers): 3091 O Nov 13 Jon Elson via c ( 10) Re: Font for DEC indicator panels 3092 Nov 13 systems_glitch ( 60) Re: Looking for optical grid mouse pad 3106 O Nov 13 Jason Howe via ( 22) Re: Swap clarification (Was: bill was my 3166 O Nov 14 systems_glitch ( 40) Re: desoldering (was Re: VAX 9440) 3173 O Nov 14 Bill Degnan via ( 48) Re: desoldering (was Re: VAX 9440) 3192 O Nov 14 Ethan Dicks via ( 28) Re: TU58 tape formatter (was Re: rebuildi 3196 O Nov 14 William Sudbrin ( 15) RE: desoldering (was Re: VAX 9440) 3208 O Nov 14 Eric Smith via ( 17) Re: TU58 tape formatter (was Re: rebuildi 3216 O Nov 14 allison via cct ( 70) Re: TU58 tape formatter (was Re: rebuildi 3227 Nov 14 ED SHARPE via c ( 5) The fundamental building block of modern 3229 O Nov 14 Ethan Dicks via ( 17) Re: TU58 tape formatter (was Re: rebuildi 3277 Nov 14 Kevin Bowling v ( 10) HP 88780B density 3388 O Nov 15 Noel Chiappa vi ( 19) Re: Font for DEC indicator panels 3473 O Nov 16 Andrew Luke Nes ( 75) Re: early ANSI C drafts, pre-1989 standar 3816 O Nov 18 Toby Thain via ( 39) Re: Font for DEC indicator panels 3835 O Nov 18 Jerome H. Fine ( 137) Re: RT-11 DY install 3845 O Nov 18 Michael Brutman ( 40) VCF PNW 2019: Exhibitors needed! 3887 O Nov 19 Patrick Finnega ( 6) IBM 3270 Emulation Adapter (ISA) 3889 O Nov 18 jim stephens vi ( 26) Re: IBM 3270 Emulation Adapter (ISA) 3940 O Nov 19 Jim Brain via c ( 10) IND 3944 O Nov 19 Al Kossow via c ( 20) Re: IBM 3270 Emulation Adapter (ISA) 3953 Nov 19 dwight via ccta ( 9) What is windoes doing? 3954 Nov 19 Ethan via cctal ( 11) Re: What is windoes doing? 3965 Nov 19 geneb via cctal ( 27) Re: What is windoes doing? 3989 Nov 19 Bill Degnan via ( 40) Re: What is windoes doing? 3997 Nov 19 Alan Perry via ( 25) Removing PVA from a CRT 3999 Nov 19 Peter Coghlan v ( 17) Re: What is windoes doing? 4041 Nov 19 Alan Perry via ( 50) Re: Removing PVA from a CRT 4046 O Nov 19 jim stephens vi ( 38) Re: IND 4052 Nov 19 Sean Conner via ( 19) IEFBR14 (was Re: IND) 4053 O Nov 19 Sven Schnelle v ( 17) Re: HP-Apollo 9000/425t RAM 4054 O Nov 19 Dennis Boone vi ( 14) Re: IND 4066 Nov 19 dwight via ccta ( 25) Re: What is windoes doing? 4071 O Nov 19 dwight via ccta ( 45) Re: What is windoes doing? 4083 O Nov 19 Al Kossow via c ( 12) Battery warning in Falco terminals 4088 O Nov 19 Al Kossow via c ( 16) Re: Battery warning in Falco terminals 4095 O Nov 19 Eric Smith via ( 15) Re: IEFBR14 (was Re: IND) 4100 Nov 19 Alan Perry via ( 32) Re: Removing PVA from a CRT 4102 Nov 19 Alan Perry via ( 83) Re: Removing PVA from a CRT 4103 O Nov 19 ben via cctalk ( 19) Re: IEFBR14 (was Re: IND) 4113 O Nov 19 Douglas Taylor ( 11) Missing FORRTL 4118 O Nov 19 Jon Elson via c ( 10) Re: IND 4122 O Nov 19 Kevin McQuiggin ( 16) Re: IND A quick comparison by eye, you seem to miss for example msg no 3277 and 4083: no 3277: -- From: Kevin Bowling via cctalk -- To: "General Discussion: On-Topic and Off-Topic Posts" -- Subject: HP 88780B density I have a dual density 88780B. Is it possible to upgrade to quad density by acquiring/swapping boards? Or does someone have an 800bpi 9-track on SCSI Incan borrow or buy? I have a pair of 1984 pdp11/70 UNIX SysV (R0, R1?) tapes that need to be archived. Regards, Kevin and no 4083: -- From: Al Kossow via cctalk -- To: "General Discussion: On-Topic and Off-Topic Posts" -- Subject: Battery warning in Falco terminals I've been helping the MAME guys simulate a TS-2624, which is a block mode HP emulating terminal. I had bought this a while ago, and never dumped the firmware. Unfortunately there is a large NiCd battery right in the middle of the board that leaked all over. I've taken some pictures which are up under falco on
Re: Text encoding Babel. Was Re: George Keremedjiev
Resend, just in case that screen-cap image attachment fails. It is also here: http://everist.org/6F2a/cctalk_rcvd.png >Will require >some way to compare mailboxes in search of pattern in missing >emails... Which may or may not be obvious... which will lead to more >puzzles... oy maybe I should have stayed muted and let others do the >job... Here's one check. See attached screen-cap of cctalk emails. Usually many per day, but only one per day on the 15th & 16th Nov, none at all on the 17th. Did the list actually go silent then? It's possible by random ebb and flow, or maybe everyone was in shock over the awful Paradise fire death toll. Which may be over 1000, unless a lot of people listed as missing do turn up. Guy
Re: Text encoding Babel. Was Re: George Keremedjiev
At 06:54 PM 23/11/2018 +0100, you wrote: >On Fri, Nov 23, 2018 at 11:55:18AM +1100, Guy Dunphy via cctalk wrote: >[...] >> >> I see them because I'm using an old email client - Eudora 3 (1997.) >> I stick with this specifically _because_ it doesn't understand UTF-8 >> or any other non-ASCII coding, especially in the header, and hence >> simply ignores any executables in the headers or email body. Which >> makes it totally virus proof, unlike Microsoft's intentionally > >Totally say totally. Except it turns out some feel that rejecting UTF-8 is culturally insensitive. I agree they have a point. But for my practical purposes, all the 'UTF-8 in header' messages that end up in my trash folder are all, always, spam. I do check. (And now someone's going to start posting cctalk messages with UTF-8 in Subject, just watch.) >> open-backdoor junk like Outlook. And most other email 'modern >> wonders.' Eudora barely even understands html in emails, and I'm >> fine with that. Also I have it configured to dust-bin any incomimg >> mail containing UTF-8 chars in the Subject header. Avoids a lot of >> time-wasting. >[...] >> >> But first, I'm having a problem with some portion of cctalk posts >> going missing, ie I don't receive all messages. The ratio seems to >> vary day to day. Sometimes no obvious missing, sometimes a lot. >> Still don't know why, or how to fix this. Any suggestions? > >Turn off trashing mails with Unicode in Subject and see if this solves >a problem? Ha, I knew someone would say that. But no, I do check the email trash bin regularly (before emptying it) and so far no cctalk or cctech emails are being diverted to there. My filter for them runs before the UTF-filter (last.) I'm guessing it's an overly picky spam filter somewhere in the network routes into Australia. Guy
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, Nov 23, 2018 at 07:01:17PM +0100, Liam Proven wrote: > On Fri, 23 Nov 2018 at 18:54, Tomasz Rola via cctalk > wrote: > > > > Turn off trashing mails with Unicode in Subject and see if this solves > > a problem? > > *Loud laughter in the office* > > Well _played_, sir! Well, that was low hanging fruit. But if he indeed turns it off and the problem is not gone, that will be a bit of puzzle. Will require some way to compare mailboxes in search of pattern in missing emails... Which may or may not be obvious... which will lead to more puzzles... oy maybe I should have stayed muted and let others do the job... -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_r...@bigfoot.com **
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, 23 Nov 2018 at 18:54, Tomasz Rola via cctalk wrote: > > Turn off trashing mails with Unicode in Subject and see if this solves > a problem? *Loud laughter in the office* Well _played_, sir! -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, Nov 23, 2018 at 11:55:18AM +1100, Guy Dunphy via cctalk wrote: [...] > > I see them because I'm using an old email client - Eudora 3 (1997.) > I stick with this specifically _because_ it doesn't understand UTF-8 > or any other non-ASCII coding, especially in the header, and hence > simply ignores any executables in the headers or email body. Which > makes it totally virus proof, unlike Microsoft's intentionally Totally say totally. > open-backdoor junk like Outlook. And most other email 'modern > wonders.' Eudora barely even understands html in emails, and I'm > fine with that. Also I have it configured to dust-bin any incomimg > mail containing UTF-8 chars in the Subject header. Avoids a lot of > time-wasting. [...] > > But first, I'm having a problem with some portion of cctalk posts > going missing, ie I don't receive all messages. The ratio seems to > vary day to day. Sometimes no obvious missing, sometimes a lot. > Still don't know why, or how to fix this. Any suggestions? Turn off trashing mails with Unicode in Subject and see if this solves a problem? -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_r...@bigfoot.com **
Re: George Keremedjiev
On Wed, Nov 21, 2018 at 07:20:25PM -0500, ED SHARPE via cctalk wrote: > wrong not everybody sees it this is the only list serve problems... > I suppose modern email programs either do not see or know what to do > with the characters... please consider using the delete key and not > reading things frI'm me if it bothers,you > thanks ed# > > Sent from AOL Mobile Mail To me, the problem is not with your emails (or anybody else's from this list), but the slow invasion performed by offending software. Since you pressed space once, it should be entered as single space, 0x20 in ASCII. If you pressed space twice, it should be entered into email written by you as two 0x20 bytes, and this is what should show on my side. My software receives some extra stuff from you, but not in a consistent manner, i.e. some ASCII spaces are prepended with extra two bytes and some not. I was not conscious about it - thought you had some peculiar space pressing manner or text postprocessor (like fmt) made double spaces in order to fit your lines into 130-characters width (because your lines were not folded at 79 or anywhere close). (In other words, it looks like everybody gets those extra bytes, only some programs choose to not show them, which - for me - is another problem and should be examined in due time). If what you press and what is being sent out to your recipients differs, then this is a problem, with potential security implications (as I learn with some horror, just anything in modern computer can turn against the owner, if he could be called owner at all). A software that mangles your input is not a friend. It should be terminated. Just MHO. -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_r...@bigfoot.com **
Re: George Keremedjiev
These ? characters often show up for users like me who read via the e-mailed digests. Kevin Anderson
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, Nov 23, 2018 at 12:12:32PM +0100, Liam Proven via cctalk wrote: > On Fri, 23 Nov 2018 at 01:55, Guy Dunphy via cctalk > wrote: [...] >> Also I have it configured to dust-bin any incomimg mail containing UTF-8 >> chars in the Subject header. Avoids a lot of time-wasting. > That's English-language cultural snobbery. I'm a native Anglophone but I live > in a non-English speaking country, Czechia. Worse than that, it's *American* ignorance and cultural snobbery which also affects various English-speaking countries. The pound sign is not in US-ASCII, and the euro sign is not in ISO-8859-1, for example. Amusingly, peering through my inbox in which I have mail in both Dutch and English, the only one with a UTF-8 subject line is in English. It was probably composed on a Windows box which "helpfully" turned a hyphen into an en-dash.
Re: Text encoding Babel. Was Re: George Keremedjiev
On Fri, 23 Nov 2018 at 01:55, Guy Dunphy via cctalk wrote: > Also I have it configured to > dust-bin any incomimg mail containing UTF-8 chars in the Subject header. > Avoids a lot of time-wasting. That's English-language cultural snobbery. I'm a native Anglophone but I live in a non-English speaking country, Czechia. For example, right now, I am in my office in Křižíkova. I can't type that name correctly without Unicode characters, because the ANSI character set doesn't contain enough letters for Czech. It can cope with some Western European letters needed for Spanish, French etc., but not even enough for the Norwegian letter ``ø''. So I can type the name of the district of Prague I'm in -- Karlín -- and you'll probably see that, but the street name, I am guessing not. "Krizikova" is usually close enough but it's not correct. Those letters are important. E.g. "sýrové" means cheesy, but "syrové" means raw. That's a significant difference. It matters to me and I'm not even Czech and don't speak it particularly well... So if you tried to mail me something at work -- the address I normally use, for instance for the Alphasmart Dana Wireless on the way to to me from Baltimore right now -- and you get a reply saying "package for [streetname] undeliverable" in the subject -- you'd just reject it. That's basically discriminating against people who don't speak your language, and in my book, that's not OK. > Takeaway: Ed, one space is enough. Look, we haven't even been able to get him to quote correctly, so I suspect changing his typing habits is right out! -- Liam Proven - Profile: https://about.me/liamproven Email: lpro...@cix.co.uk - Google Mail/Hangouts/Plus: lpro...@gmail.com Twitter/Facebook/Flickr: lproven - Skype/LinkedIn: liamproven UK: +44 7939-087884 - ČR (+ WhatsApp/Telegram/Signal): +420 702 829 053
Re: George Keremedjiev
On Thu, 22 Nov 2018, Robert Feldman wrote: BTW, we went through this about 6 months ago. Someone pointed out the strange characters in Ed's posts. No change resulted from that, however, and I doubt this thread will cause any change. Yup, Ed is resistant to any form of advice. He could just install a real mail client on his mobile phone instead of using the crappy AOL client. ;-) Christian
Text encoding Babel. Was Re: George Keremedjiev
At 10:33 PM 21/11/2018 -0500, ED SHARPE wrote: >if I type an extra space I am sure every one sees it. but the chars not >everyone sees them. >what I do figure us the older email programs are not accepting of all charter >sets? ( dunno if I am using the right term) > >Sent from AOL Mobile Mail Ah ha! Mystery explained. I'm another who sees funny characters where Ed's mails contain "c2 a0". This is the UTF-8 encoding of a 'no-break space' character, which is NOT in the original ASCII set. See https://apps.timwhitlock.info/unicode/inspect/hex/c2/a0 I see them because I'm using an old email client - Eudora 3 (1997.) I stick with this specifically _because_ it doesn't understand UTF-8 or any other non-ASCII coding, especially in the header, and hence simply ignores any executables in the headers or email body. Which makes it totally virus proof, unlike Microsoft's intentionally open-backdoor junk like Outlook. And most other email 'modern wonders.' Eudora barely even understands html in emails, and I'm fine with that. Also I have it configured to dust-bin any incomimg mail containing UTF-8 chars in the Subject header. Avoids a lot of time-wasting. Anyway, I was wondering how Ed's emails (and sometimes others elsewhere) acquired that odd corruption. Answer: Ed's email util (AOL Mobile Mail, and probably various other 'content enhanced' email clients) interpret the user typing space twice in succession, as meaning "I really, really want there to be a space here, no matter what." So it inserts a 'no-break space' unicode character, which of course requires a 2-byte UTF-8 encoding. Then adds a plain ASCII space 0x20 just to be sure. Personally I find it more interesting than annoying. Just another example of the gradual chaotic devolution of ASCII, into a Babel of incompatible encodings. Not that ASCII was all that great in the first place. It's also interesting that even on cctalk, where you'd think everyone would be aware of the differences between ASCII and later 'extensions', low level coding schemes, and the desirability of sticking to common standards, some are not. Takeaway: Ed, one space is enough. I don't know how you got the idea people might miss seeing a single space, and so you need to type two or more. But it isn't so. The normal convention in plain text is one space character between each word. And since plain ASCII is hard-formatted, extra spaces are NOT ignored and make for wider spacing between words. Which looksvery odd, even if your mail utility didn't try to do something 'special' with your unusual user input. Btw, I changed the subject line, because this is a wider topic. I've been meaning to start a conversation about the original evolution of ASCII, and various extensions. Related to a side project of mine. But first, I'm having a problem with some portion of cctalk posts going missing, ie I don't receive all messages. The ratio seems to vary day to day. Sometimes no obvious missing, sometimes a lot. Still don't know why, or how to fix this. Any suggestions? Guy >On Wednesday, November 21, 2018 Fred Cisin wrote: >Ed, >It is YOUR mail program that is doing the extraneous insertions, and >then not showing them to you when you view your own messages. > >ALL of us see either extraneous characters, or extraneous spaces in >everything that you send! >I use PINE in a shell account, and they show up as a whole bunch of >inappropriate spaces. > >Seriously, YOUR mail program is inserting extraneous stuff. >Everybody? but you sees it. > >> who knows?  what mail program are you using that  does that? >It is YOUR mail program that is "doing that"!! > > >On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: > >> who knows?  what mail program are you using that  does that? >> >> >> In a message dated 11/21/2018 1:25:08 PM US Mountain Standard Time, >> cctalk@classiccmp.org writes: >> >>  >> At 02:03 PM 11/21/2018, ED SHARPE via cctalk wrote: >> >>> Ià soldà him myà extra classic 8à with the plexi covers on it... sn >>> 200à seriesà weà keptà sn #18 >> >> Side question: What process is turning non-blanking spaces into ISO-8859-1 >> circumflex-A for you? >> >> I see 'Ã' all throughout your emails. >> >> - John > >
Re: George Keremedjiev
>Message: 10 >Date: Wed, 21 Nov 2018 16:17:27 -0500 >From: ED SHARPE >To: jfo...@threedee.com, cctalk@classiccmp.org, cctalk@classiccmp.org >Subject: Re: George Keremedjiev >Message-ID: <16738228ce4-1ebf-2...@webjas-vad240.srv.aolmail.net> >Content-Type: text/plain; charset=utf-8 > >who? knows?? ?what? mail program? are? you using that? ?does that? > > >In a message dated 11/21/2018 1:25:08 PM US Mountain Standard Time, >cctalk@classiccmp.org writes: > >? >At 02:03 PM 11/21/2018, ED SHARPE via cctalk wrote: > >>I? sold? him my? extra classic 8? with the plexi covers on it... sn 200? >>series? we? kept? sn #18 > >Side question: What process is turning non-blanking spaces into ISO-8859-1 >circumflex-A for you? > >I see '?' all throughout your emails. > >- John I get CCTalk in digest form and see the "?" in Ed's posts. Almost all (but strangely not all) of his posts are like that. I might occasionally see a strange extra character in someone else's post, but only rarely and then they usually are some non-English diacritical mark. BTW, we went through this about 6 months ago. Someone pointed out the strange characters in Ed's posts. No change resulted from that, however, and I doubt this thread will cause any change. Bob
Re: George Keremedjiev
- Original Message - From: "geneb via cctalk" To: "ED SHARPE" ; "General Discussion: On-Topic and Off-Topic Posts" Sent: Thursday, November 22, 2018 11:45 AM Subject: Re: George Keremedjiev > On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: > >> not much adjustments... may be easier if you just bypass my messages? >> >> Sent from AOL Mobile Mail >> > Maybe it's because many of us don't use a point-and-drool interface that > would give the user the chance to skip the message before being forced to > read it. > > Look, I get that you've decided that hundreds of people are wrong and it's > not your fault. How about we work on getting you to stop top posting > instead? ;) > > g. > > And proofreading a bit before pressing 'send'...
Re: George Keremedjiev
On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: not much adjustments... may be easier if you just bypass my messages? Sent from AOL Mobile Mail Maybe it's because many of us don't use a point-and-drool interface that would give the user the chance to skip the message before being forced to read it. Look, I get that you've decided that hundreds of people are wrong and it's not your fault. How about we work on getting you to stop top posting instead? ;) g. -- Proud owner of F-15C 80-0007 http://www.f15sim.com - The only one of its kind. http://www.diy-cockpits.org/coll - Go Collimated or Go Home. Some people collect things for a hobby. Geeks collect hobbies. ScarletDME - The red hot Data Management Environment A Multi-Value database for the masses, not the classes. http://scarlet.deltasoft.com - Get it _today_!
Re: George Keremedjiev
On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: who knows? what mail program are you using that does that? In a message dated 11/21/2018 1:25:08 PM US Mountain Standard Time, cctalk@classiccmp.org writes: At 02:03 PM 11/21/2018, ED SHARPE via cctalk wrote: I sold him my extra classic 8 with the plexi covers on it... sn 200 series we kept sn #18 Side question: What process is turning non-blanking spaces into ISO-8859-1 circumflex-A for you? I see 'Â' all throughout your emails. It's not his email client that's the problem, it's yours. It constantly inserts weird characters between words. I see the same problem in Alpine, and I've never seen the issue from any other sender. g. -- Proud owner of F-15C 80-0007 http://www.f15sim.com - The only one of its kind. http://www.diy-cockpits.org/coll - Go Collimated or Go Home. Some people collect things for a hobby. Geeks collect hobbies. ScarletDME - The red hot Data Management Environment A Multi-Value database for the masses, not the classes. http://scarlet.deltasoft.com - Get it _today_!
Re: George Keremedjiev
not much adjustments... may be easier if you just bypass my messages? Sent from AOL Mobile Mail On Wednesday, November 21, 2018 Fred Cisin wrote: On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: > wrong not everybody sees it this is the only list serve problems... I > suppose modern email programs either do not see or know what to do with > the characters... please consider using the delete key and not reading > things frI'm me if it bothers,you > thanks ed# That is a very good hypothesis. "Modern" (bordering on profanity in this list) email programs might insert characters that we are not intended to notice in support of "features" (also bordering on profanity). When they encounter those special characters, they know to activate that "feature", and suppress their display. But email progams from "LAST MONTH" (prior to the "10 year rule"?) do NOT recognize, respect, nor understand those "modern" "control" characters. ("Modern" companies, such as Microsoft, Apple, AOL, etc. deprecate the use of any software or hardware that is not "current") Email seems to be being handled like word processor file formats - what happens when you try to load a document from a current version prograam into a copy of a previous version of the program? You would never know there was an issue if everybody that you associate is using the same current programs. Q: is line wrap ON or OFF in the program? Q: is "format: flowed" ON or OFF? Either/both might insert "non-breaking spaces". These do not seem to be adequately documented in this context - (differentiation between "bug" and "feature").
Re: George Keremedjiev
if I type an extra space I am sure every one sees it. but the chars not everyone sees them. what I do figure us the older email programs are not accepting of all charter sets? ( dunno if I am using the right term) Sent from AOL Mobile Mail On Wednesday, November 21, 2018 Fred Cisin wrote: Ed, It is YOUR mail program that is doing the extraneous insertions, and then not showing them to you when you view your own messages. ALL of us see either extraneous characters, or extraneous spaces in everything that you send! I use PINE in a shell account, and they show up as a whole bunch of inappropriate spaces. Seriously, YOUR mail program is inserting extraneous stuff. Everybody? but you sees it. > who knows? what mail program are you using that does that? It is YOUR mail program that is "doing that"!! On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: > who knows? what mail program are you using that does that? > > > In a message dated 11/21/2018 1:25:08 PM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > At 02:03 PM 11/21/2018, ED SHARPE via cctalk wrote: > >> I sold him my extra classic 8 with the plexi covers on it... sn 200 >> series we kept sn #18 > > Side question: What process is turning non-blanking spaces into ISO-8859-1 > circumflex-A for you? > > I see 'Â' all throughout your emails. > > - John
Re: George Keremedjiev
some blank spaces whereas us 2 instead of one is some times bad mr. hand Sent from AOL Mobile Mail On Wednesday, November 21, 2018 Fred Cisin wrote: Ed, It is YOUR mail program that is doing the extraneous insertions, and then not showing them to you when you view your own messages. ALL of us see either extraneous characters, or extraneous spaces in everything that you send! I use PINE in a shell account, and they show up as a whole bunch of inappropriate spaces. Seriously, YOUR mail program is inserting extraneous stuff. Everybody? but you sees it. > who knows? what mail program are you using that does that? It is YOUR mail program that is "doing that"!! On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: > who knows? what mail program are you using that does that? > > > In a message dated 11/21/2018 1:25:08 PM US Mountain Standard Time, > cctalk@classiccmp.org writes: > > > At 02:03 PM 11/21/2018, ED SHARPE via cctalk wrote: > >> I sold him my extra classic 8 with the plexi covers on it... sn 200 >> series we kept sn #18 > > Side question: What process is turning non-blanking spaces into ISO-8859-1 > circumflex-A for you? > > I see 'Â' all throughout your emails. > > - John
Re: George Keremedjiev
On Wed, 21 Nov 2018, ED SHARPE via cctalk wrote: wrong not everybody sees it this is the only list serve problems... I suppose modern email programs either do not see or know what to do with the characters... please consider using the delete key and not reading things frI'm me if it bothers,you thanks ed# That is a very good hypothesis. "Modern" (bordering on profanity in this list) email programs might insert characters that we are not intended to notice in support of "features" (also bordering on profanity). When they encounter those special characters, they know to activate that "feature", and suppress their display. But email progams from "LAST MONTH" (prior to the "10 year rule"?) do NOT recognize, respect, nor understand those "modern" "control" characters. ("Modern" companies, such as Microsoft, Apple, AOL, etc. deprecate the use of any software or hardware that is not "current") Email seems to be being handled like word processor file formats - what happens when you try to load a document from a current version prograam into a copy of a previous version of the program? You would never know there was an issue if everybody that you associate is using the same current programs. Q: is line wrap ON or OFF in the program? Q: is "format: flowed" ON or OFF? Either/both might insert "non-breaking spaces". These do not seem to be adequately documented in this context - (differentiation between "bug" and "feature").