On 3/12/2012 10:24 AM, Martin Baldan wrote:

that is a description of random data, which granted, doesn't apply to most
(compressible) data.
that wasn't really the point though.
I thought the original point was that there's a clear-cut limit to how
much redundancy can be eliminated from computing environments, and
that thousand-fold (and beyond) reductions in code size per feature
don't seem realistic. Then the analogy from data compression was used.
I think it's a pretty good analogy, but I don't think there's a
clear-cut limit we can estimate in advance, because meaningful data
and computations are not random to begin with. Indeed, there are
islands of stability where you've cut all the visible cruft and you
need new theoretical insights and new powerful techniques to reduce
the code size further.

this is possible, but it assumes, essentially, that one doesn't run into such a limit.

if one gets to a point where every "fundamental" concept is only ever expressed once, and everything is built from preceding fundamental concepts, then this is a limit, short of dropping fundamental concepts.

for example, I was able to devise a compression scheme which reduced
S-Expressions to only 5% their original size. now what if I want 3%, or 1%?
this is not an easy problem. it is much easier to get from 10% to 5% than to
get from 5% to 3%.
I don't know, but there may be ways to reduce it much further if you
know more about the sexprs themselves. Or maybe you can abstract away
the very fact that you are using sexprs. For instance, if those sexprs
are a Scheme program for a tic-tac-toe player, you can say "write a
tic-tac-toe player in Scheme" and you capture the essence.

the sexprs were mostly related to scene-graph delta messages (one could compress a Scheme program, but this isn't really what it is needed for).

each expression basically tells about what is going on in the world at that moment (objects appearing and moving around, lights turning on/off, ...). so, basically, a semi-constant message stream.

the specialized compressor was doing better than Deflate, but was also exploiting a lot more knowledge about the expressions as well: what the basic types are, how things fit together, ...

theoretically, about the only way to really do much better would be using a static schema (say, where the sender and receiver have a predefined set of message symbols, predefined layout templates, ...). personally though, I really don't like these sorts of compressors (they are very brittle, inflexible, and prone to version issues).

this is essentially what "write a tic-tac-toe player in Scheme" implies:
both the sender and receiver of the message need to have a common notion of both "tic-tac-toe player" and "Scheme". otherwise, the message can't be decoded.

a more general strategy is basically to build a model "from the ground up", where the sender and reciever have only basic knowledge of basic concepts (the basic compression format), and most everything else is built on the fly based on the data which has been seen thus far (old data is used to build new data, ...).

in LZ77 based algos (Deflate: ZIP/GZ/PNG; LZMA: 7zip; ...), this takes the form of a "sliding window", where any recently seen character sequence is simply reused (via an offset/length run).

in my case, it is built from primitive types (lists, symbols, strings, fixnums, flonums, ...).

I expect much of future progress in code reduction to come from
automated integration of different systems, languages and paradigms,
and this integration to come from widespread development and usage of
ontologies and reasoners. That way, for instance, you could write a
program in BASIC, and then some reasoner would ask you questions such
as "I see you used a GOTO to build a loop. Is that correct?" or "this
array is called 'clients'  , do you mean it as in server/client
architecture or in the business sense?" . After a few questions like
that, the system would have a highly descriptive model of what your
program is supposed to do and how it is supposed to do it. Then it
would be able to write an equivalent program in any other programming
language. Of course, once you have such a system, there would be much
more powerful user interfaces than some primitive programming
language. Probably you would speak in natural language (or very close)
and use your hands to point at things. I know it sounds like full-on
AI, but I just mean an expert system for programmers.

and, of course, such a system would likely be, itself, absurdly complex...

this is partly the power of information entropy though:
it can't really be created or destroyed, only really moved around from one place to another.

so, one can express things simply to a system, and it gives powerful outputs, but likely the system itself is very complex. one can express things to a simple system, but generally this act of expression tends to be much more complex. in either case, the complexity is still there, just where it is that is different (it may also often be along the time axis as well).

the problem is, however, that one can't generally use simple expression with a simple system, this tends to not really work.

although many current programs are, arguably, huge, the vast majority of the
code is likely still there for a reason, and is unlikely the result of
programmers just endlessly writing the same stuff over and over again, or
resulting from other simple patterns. rather, it is more likely piles of
special case logic and optimizations and similar.

I think one problem is that not "writing the same stuff over and over
again" is easier said than done. To begin with, other people's code
may not even be available (or not under a free license). But even if
it is, it may have used different names, different coding patterns,
different third-party libraries and so on, while still being basically
the same. And this happens even within the same programming language
and environment. Not to speak of all the plethora of competing
platforms, layering schemes, communication protocols, programming
languages, programming paradigms, programming frameworks and so on.
Everyone says "let's do it may way", and then "my system can host
yours", "same here", "let's make a standard", "let's extend the
standard", "let's make a cleaner standard", "now for real", "let's be
realistic and use the available standards" "let's not reinvent the
wheel, we need backwards compatibility", "backwards compatibility is a
drag, let's reinvent the wheel". Half-baked standards become somewhat
popular, and then they have to be supported.

And that's how you get a huge software stack. Redundancy can be
avoided in centralized systems, but in distributed systems with
competing standards that's the normal state. It's not that programmers
are dumb, it's that they can't agree on pretty much anything, and they
can't even keep track of each other's ideas because the community is
so huge.

and this is also partly why making everything smaller (while keeping its features intact) would likely end up looking a fair amount like data compression (it is compression code and semantics space).

one might have to deal with all of this, but with a smaller core. the result then is likely to look like a sort of "super chameleon", which reduces the redundancy by allowing itself to mimic nearly anything it comes across (nearly every possible protocol, data-format, API, programming language, ...).

such a beast, however, is not likely to end up being something which could be called "simple" (although it could still be considerably less complex and bulky than the combined result of everything it mimics).

some of this is also what makes my VM sub-project as complex as it is: it deals with a variety of problem cases, and each adds a little complexity, and all this adds up. likewise, some things, such as interfacing (directly) with C code and data, add more complexity than others (simpler and cleaner FFI makes the VM itself much more complex).

but, even as such, my VM is still tiny vs GCC or Mono (it is still more on par with other JavaScript VMs).

fonc mailing list

fonc mailing list

Reply via email to