Re: [OT] Effect of UTF-8 on 2G connections

2016-06-02 Thread Joakim via Digitalmars-d

On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:

On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
It's not hard.  I think a lot of us remember when a 14.4 
modem was cutting-edge.


Well, then apparently you're unaware of how bloated web pages 
are nowadays.  It used to take me minutes to download popular 
web pages _back then_ at _top speed_, and those pages were a 
_lot_ smaller.


It's telling that you think the encoding of the text is 
anything but the tiniest fraction of the problem.  You should 
look at where the actual weight of a "modern" web page comes 
from.


I'm well aware that text is a small part of it.  My point is that 
they're not downloading those web pages, they're using mobile 
instead, as I explicitly said in a prior post.  My only point in 
mentioning the web bloat to you is that _your perception_ is off 
because you seem to think they're downloading _current_ web pages 
over 2G connections, and comparing it to your downloads of _past_ 
web pages with modems.  Not only did it take minutes for us back 
then, it takes _even longer_ now.


I know the text encoding won't help much with that.  Where it 
will help is the mobile apps they're actually using, not the 
bloated websites they don't use.



Codepages and incompatible encodings were terrible then, too.

Never again.


This only shows you probably don't know the difference between 
an encoding and a code page,


"I suggested a single-byte encoding for most languages, with 
double-byte for the ones which wouldn't fit in a byte. Use some 
kind of header or other metadata to combine strings of 
different languages, _rather than encoding the language into 
every character!_"


Yeah, that?  That's codepages.  And your exact proposal to put 
encodings in the header was ALSO tried around the time that 
Unicode was getting hashed out.  It sucked.  A lot.  (Not as 
bad as storing it in the directory metadata, though.)


You know what's also codepages?  Unicode.  The UCS is a 
standardized set of code pages for each language, often merely 
picking the most popular code page at that time.


I don't doubt that nothing I'm saying hasn't been tried in some 
form before.  The question is whether that alternate form would 
be better if designed and implemented properly, not if a botched 
design/implementation has ever been attempted.


Well, when you _like_ a ludicrous encoding like UTF-8, not 
sure your opinion matters.


It _is_ kind of ludicrous, isn't it?  But it really is the 
least-bad option for the most text.  Sorry, bub.


I think we can do a lot better.


Maybe.  But no one's done it yet.


That's what people said about mobile devices for a long time, 
until about a decade ago.  It's time we got this right.


The vast majority of software is written for _one_ language, 
the local one.  You may think otherwise because the software 
that sells the most and makes the most money is 
internationalized software like Windows or iOS, because it can 
be resold into many markets.  But as a percentage of lines of 
code written, such international code is almost nothing.


I'm surprised you think this even matters after talking about 
web pages.  The browser is your most common string processing 
situation.  Nothing else even comes close.


No, it's certainly popular software, but at the scale we're 
talking about, ie all string processing in all software, it's 
fairly small.  And the vast majority of webapps that handle 
strings passed from a browser are written to only handle one 
language, the local one.


largely ignoring the possibilities of the header scheme I 
suggested.


"Possibilities" that were considered and discarded decades ago 
by people with way better credentials.  The era of single-byte 
encodings is gone, it won't come back, and good riddance to bad 
rubbish.


Lol, credentials. :D If you think that matters at all in the face 
of the blatant stupidity embodied by UTF-8, I don't know what to 
tell you.


I could call that "trolling" by all of you, :) but I'll 
instead call it what it likely is, reactionary thinking, and 
move on.


It's not trolling to call you out for clearly not doing your 
homework.


That's funny, because it's precisely you and others who haven't 
done your homework.  So are you all trolling me?  By your 
definition of trolling, which btw is not the standard one, _you_ 
are the one doing it.



I don't think you understand: _you_ are the special case.


Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
any sequence of glyphs and combining characters 
(correctly-formed or not) in any language at any time, so we're 
the special case...?


And you're doing so by mostly using a single-byte encoding for 
_your own_ Euro-centric languages, ie ASCII, while imposing 
unnecessary double-byte and triple-byte encodings on everyone 
else, despite their outnumbering you 10 to 1.  That is the very 
definition of a special case.



Yeah, 

Re: [OT] Effect of UTF-8 on 2G connections

2016-06-01 Thread Wyatt via Digitalmars-d

On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
It's not hard.  I think a lot of us remember when a 14.4 modem 
was cutting-edge.


Well, then apparently you're unaware of how bloated web pages 
are nowadays.  It used to take me minutes to download popular 
web pages _back then_ at _top speed_, and those pages were a 
_lot_ smaller.


It's telling that you think the encoding of the text is anything 
but the tiniest fraction of the problem.  You should look at 
where the actual weight of a "modern" web page comes from.



Codepages and incompatible encodings were terrible then, too.

Never again.


This only shows you probably don't know the difference between 
an encoding and a code page,


"I suggested a single-byte encoding for most languages, with 
double-byte for the ones which wouldn't fit in a byte. Use some 
kind of header or other metadata to combine strings of different 
languages, _rather than encoding the language into every 
character!_"


Yeah, that?  That's codepages.  And your exact proposal to put 
encodings in the header was ALSO tried around the time that 
Unicode was getting hashed out.  It sucked.  A lot.  (Not as bad 
as storing it in the directory metadata, though.)


Well, when you _like_ a ludicrous encoding like UTF-8, not 
sure your opinion matters.


It _is_ kind of ludicrous, isn't it?  But it really is the 
least-bad option for the most text.  Sorry, bub.


I think we can do a lot better.


Maybe.  But no one's done it yet.

The vast majority of software is written for _one_ language, 
the local one.  You may think otherwise because the software 
that sells the most and makes the most money is 
internationalized software like Windows or iOS, because it can 
be resold into many markets.  But as a percentage of lines of 
code written, such international code is almost nothing.


I'm surprised you think this even matters after talking about web 
pages.  The browser is your most common string processing 
situation.  Nothing else even comes close.


largely ignoring the possibilities of the header scheme I 
suggested.


"Possibilities" that were considered and discarded decades ago by 
people with way better credentials.  The era of single-byte 
encodings is gone, it won't come back, and good riddance to bad 
rubbish.


I could call that "trolling" by all of you, :) but I'll instead 
call it what it likely is, reactionary thinking, and move on.


It's not trolling to call you out for clearly not doing your 
homework.



I don't think you understand: _you_ are the special case.


Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
any sequence of glyphs and combining characters (correctly-formed 
or not) in any language at any time, so we're the special case...?


Yeah, it sounds funny to me, too.

The 5 billion people outside the US and EU are _not the special 
case_.


Fortunately, it works for them to.

The problem is all the rest, and those just below who cannot 
afford it at all, in part because the tech is not as efficient 
as it could be yet.  Ditching UTF-8 will be one way to make it 
more efficient.


All right, now you've found the special case; the case where the 
generic, unambiguous encoding may need to be lowered to something 
else: people for whom that encoding is suboptimal because of 
_current_ network constraints.


I fully acknowledge it's a couple billion people and that's 
nothing to sneeze at, but I also see that it's a situation that 
will become less relevant over time.


-Wyatt


Re: [OT] Effect of UTF-8 on 2G connections

2016-06-01 Thread Joakim via Digitalmars-d

On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:

Am Wed, 01 Jun 2016 13:57:27 +
schrieb Joakim :

No, I explicitly said not the web in a subsequent post.  The 
ignorance here of what 2G speeds are like is mind-boggling.


I've used 56k and had a phone conversation with my sister while 
she was downloading a 800 MiB file over 2G. You just learn to 
be patient (or you already are when the next major city is 
hundreds of kilometers away) and load only what you need. Your 
point about the costs convinced me more.


I see that max 2G speeds are 100-200 kbits/s.  At that rate, it 
would have taken her more than 10 hours to download such a large 
file, that's nuts.  The worst part is when the download gets 
interrupted and you have to start over again because most 
download managers don't know how to resume, including the stock 
one on Android.


Also, people in these countries buy packs of around 100-200 MB 
for 30-60 US cents, so they would never download such a large 
file.  They use messaging apps like Whatsapp or WeChat, which 
nobody in the US uses, to avoid onerous SMS charges.


Here is one article spiced up with numbers and figures: 
http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third


Yes, only the middle class, which are at most 10-30% of the 
population in these developing countries, can even afford 2G.  
The way to get costs down even further is to make the tech as 
efficient as possible.  Of course, much of the rest of the 
population are illiterate, so there are bigger problems there.



But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume.


Yes, by ditching UTF-8.


There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.


That's not going to happen, even HTTP/2 ditched that requirement. 
 Also, many of those countries' govts will not allow it: google 
how Blackberry had to give up their keys for "secure" BBM in many 
countries.  It's not just Canada and the US spying on their 
citizens.



Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.


I didn't know they used XML.  Googling it now, I see mention that 
they switched to an "internally developed protocol" at some 
point, so I doubt they're using XML now.



This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6


Haha, no wonder Jabber is dead. :) I jumped on Jabber for my own 
messages a decade ago, as it seemed like an open way out of that 
proprietary messaging mess, then I read that they're using XML 
and gave up on it.


On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:

On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:


No, I explicitly said not the web in a subsequent post.  The 
ignorance here of what 2G speeds are like is mind-boggling.


It's not hard.  I think a lot of us remember when a 14.4 modem 
was cutting-edge.


Well, then apparently you're unaware of how bloated web pages are 
nowadays.  It used to take me minutes to download popular web 
pages _back then_ at _top speed_, and those pages were a _lot_ 
smaller.



Codepages and incompatible encodings were terrible then, too.

Never again.


This only shows you probably don't know the difference between an 
encoding and a code page, which are orthogonal concepts in 
Unicode.  It's not surprising, as Walter and many others 
responding show the same ignorance.  I explained this repeatedly 
in the previous thread, but it depends on understanding the tech, 
and I can't spoon-feed that to everyone.


Well, when you _like_ a ludicrous encoding like UTF-8, not 
sure your opinion matters.


It _is_ kind of ludicrous, isn't it?  But it really is the 
least-bad option for the most text.  Sorry, bub.


I think we can do a lot better.

No. The common string-handling use case is code that is 
unaware which script (not language, btw) your text is in.


Lol, this may be the dumbest argument put forth yet.


This just makes it feel like you're trolling.  You're not just 
trolling, right?


Are you trolling?  Because I was just calling it like it is.

The vast majority of software is written for _one_ language, the 
local one.  You may think otherwise because the software that 
sells the most and makes the most money is internationalized 
software like Windows or iOS, because it can be resold into many 
markets.  But as a percentage of lines of code written, such 
international code is almost nothing.


I don't think anyone here even understands what a good 
encoding is and what it's for, which is why there's no 

[OT] Effect of UTF-8 on 2G connections

2016-06-01 Thread Marco Leise via Digitalmars-d
Am Wed, 01 Jun 2016 13:57:27 +
schrieb Joakim :

> No, I explicitly said not the web in a subsequent post.  The 
> ignorance here of what 2G speeds are like is mind-boggling.

I've used 56k and had a phone conversation with my sister
while she was downloading a 800 MiB file over 2G. You just
learn to be patient (or you already are when the next major
city is hundreds of kilometers away) and load only what you
need. Your point about the costs convinced me more.

Here is one article spiced up with numbers and figures:
http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume. There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.
Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.
This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

-- 
Marco