"A couple of years ago we started this project called Squeak, which is
simply not an attempt to give the world a free Smalltalk, but an
attempt to give the world a bootstrapping mechanism for something much
better than Smalltalk, and when you fool around with Squeak, please,
please, think of it from that standpoint. Think of how you can
obsolete the damn thing by using its own mechanisms for getting the
next version of itself." - Alan Kay, The Computer Revolution Hasn't
Happened Yet, October 7, 1997, OOPSLA'97 Keynote.
On May 26, 2011, at 1:57 AM, Max OrHai wrote:
Have you looked at Jecel Assumpcao's SiliconSqueak? An awful lot can
be done on the cheap with modern FPGAs, so long as you don't stray
too far from the conventional CPU design space...
On May 26, 2011, at 2:46 AM, Casey Ransberger wrote:
Thanks for recommending Silicon Squeak. Jecel's project is so
awesome! And while I totally can't wait to have one:) I think what I
can do this year will likely be limited to integrating off the shelf
parts. That said, I'm hoping I can create something interesting even
with those constraints. I've bounced email back and forth with
Jecel, and I really like his point of view:)
Allow me to describe a few key features of our SiliconSqueak research
here, as most papers have not been published yet.
SiliconSqueak, like the Xerox Alto, is a microcode processor. It has
many 32 bit cores, each with a stack and data cache, a minimal 5 stage
instruction pipeline and ring networks. In the microcode we implement
the Squeak bytecodes so the processor can behave like the software
Squeak Virtual Machine running standard Squeak images bit-identical.
Because we optimized the design for Squeak bytecodes, message sends
are particularly efficient. The use of microcode allows many other
bytecode or virtual machine systems to be implemented easily,
including Lisp, Python, Frank (as a target assembly/bytecode or even
the stack oriented abstract machine itself). The Worlds mechanism
could also be implemented in microcode to achieve a significant speedup.
Although you can emulate almost any system in microcode (as it is
turing-complete), the further you go from the Squeak bytecode model,
the less efficient it might get. I guess a C compiler targeting the
microcode could be more than 4 times slower than on a processor
optimized to run C.
A typical SiliconSqueak FPGA or ASIC has a number of cores connected
with multiple ring networks to other cores, memory and high speeds
links, ranging from 4 x 3,1 Gbps to 88 x 28 Gbps. I think of it as a
roomful of Alto's on a chip. Sending a message to an object (in Squeak
implemented with bytecodes) can be handled in many different ways by
the microcode and underlying hardware ring network. A message send can
be sent to any object anywhere. It can be in the cache, in the local
object heap, in another core or in cores reachable through the
external links to neighboring SiliconSqueak units and routed by the
hardware until it reaches the location of the object. If the object is
external to interconnected the clusters of SiliconSqueaks forming a
supercomputer, it can also be handed off to Smalltalk code running in
the image, that can then send it as IP packets over the internet to a
remote Squeak image (that itself may be running on a manycore
SiliconSqueak or software implemented Squeak VM).
A single Squeak image can at runtime distribute its objects among the
memories of all cores of all SiliconSqueak processors. A message sends
to an remote object would run code on the remote core, achieving
parallelism that can be transparent to the Squeak programmer. However,
it can also be utilized for explicit forms of parallelism. By
extending the functionality of the message send behavior different
(parallel) programming models can be accommodated (for example future
sends as in Actors).
Other auto-tuning algorithms can (transparently) redistribute objects
among the cores to implement other ways to exploit fine or course
grained parallelism. It is also an option to have multiple images
running on the system, some on a single core, some using multiple
cores. Intercommunicating with external Squeak images on VMs on Unix
or other operating systems will appear from the programmers point of
view as transparent as among SiliconSqueak cores.
SiliconSqueak is a power efficient, parallel, reconfigurable
architecture optimized for adaptive compilation. The system includes a
mix of basic and extended processors, where these extensions are
configurable accelerators like the 64 ALU matrix. Other accelerators
like a vector processor, a graphics processor, a MPEG encode/decoder,
an encrypt/decrypt processor or an FPU can be implemented as hardware.
These accelerators can also be implemented in microcode with the 64
ALU matrix. As these accelerators communicate through the ring
network, they can be daisy chained to form larger matrixes.
In FPGA implementations the ratio of SiliconSqueak cores and
accelerators can changes at runtime under control of the adaptive
compilation. So bytecodes will initially be interpreted, could then be
recompiled into microcoded polymorphic inline caches. A second
recompilation could implement the code on the 64 ALU matrix. A third
level analysis can identify hotspots and reconfigure the FPGA hardware
accordingly while the code is still running, replacing some
SiliconSqueak cores with 64 ALU matrixes other configurable
accelerators and vice versa.
In the much faster ASIC implementations you are stuck with the choices
Jecel and I will make based on results of running many Squeak images
and having the adaptive compilation help find the optimal balance
between the number of cores and configurable accelerators.
We ourselves will produce some low cost FPGA and ASIC systems this
quarter. First a scalable cluster of 8 core SiliconSqueak FPGA's that
you can interconnect with backplanes into supercomputers. Next a very
low cost 16 core ASIC version with 10 Gbps Thunderbolt optical links.
But you can just as easily use many other medium and high cost FPGA
development boards out there, ranging from $80 (1 core) to $25,000
(400 cores, 88 x 28 Gbps for a total of 2,7 Tbps per chip). Most
brands of FPGA can accommodate SiliconSqueak easily. An FPGA core will
soon not fall below $10 per core, but ASICs can reach fractions of a
dollar cent. In future WSI (Wafer Scale Integration) the number of
cores and size of cache or memory will be fixed, although a changing
percentage of these cores will not function because of the unavoidable
damage to some parts on the wafer. Depending on the amount of memory
per core, on the size of wafer (8-30 inch) and the process used,
thousands to millions of cores are possible. Eventually arriving at a
computer the size and price of an iPad with a million cores is
entirely feasible.
Casey, why not just build it?
Our small, self funded research group designed these SiliconSqueak
hardware systems for the next stages of the research in massively
parallel, message passing, late bound dynamic software systems,
growing more and more away from standard Squeak but retaining
backwards compatibility without restricting us.
We welcome anyone who wants to to implement their own bytecode
language systems like Frank, Lisp, etc and we invite people to
collaborate with us on our own Squeak based research.
Merik Voswinkel
_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc