"A couple of years ago we started this project called Squeak, which is simply not an attempt to give the world a free Smalltalk, but an attempt to give the world a bootstrapping mechanism for something much better than Smalltalk, and when you fool around with Squeak, please, please, think of it from that standpoint. Think of how you can obsolete the damn thing by using its own mechanisms for getting the next version of itself." - Alan Kay, The Computer Revolution Hasn't Happened Yet, October 7, 1997, OOPSLA'97 Keynote.

On May 26, 2011, at 1:57 AM, Max OrHai wrote:

Have you looked at Jecel Assumpcao's SiliconSqueak? An awful lot can be done on the cheap with modern FPGAs, so long as you don't stray too far from the conventional CPU design space...

On May 26, 2011, at 2:46 AM, Casey Ransberger wrote:

Thanks for recommending Silicon Squeak. Jecel's project is so awesome! And while I totally can't wait to have one:) I think what I can do this year will likely be limited to integrating off the shelf parts. That said, I'm hoping I can create something interesting even with those constraints. I've bounced email back and forth with Jecel, and I really like his point of view:)


Allow me to describe a few key features of our SiliconSqueak research here, as most papers have not been published yet.

SiliconSqueak, like the Xerox Alto, is a microcode processor. It has many 32 bit cores, each with a stack and data cache, a minimal 5 stage instruction pipeline and ring networks. In the microcode we implement the Squeak bytecodes so the processor can behave like the software Squeak Virtual Machine running standard Squeak images bit-identical. Because we optimized the design for Squeak bytecodes, message sends are particularly efficient. The use of microcode allows many other bytecode or virtual machine systems to be implemented easily, including Lisp, Python, Frank (as a target assembly/bytecode or even the stack oriented abstract machine itself). The Worlds mechanism could also be implemented in microcode to achieve a significant speedup. Although you can emulate almost any system in microcode (as it is turing-complete), the further you go from the Squeak bytecode model, the less efficient it might get. I guess a C compiler targeting the microcode could be more than 4 times slower than on a processor optimized to run C.

A typical SiliconSqueak FPGA or ASIC has a number of cores connected with multiple ring networks to other cores, memory and high speeds links, ranging from 4 x 3,1 Gbps to 88 x 28 Gbps. I think of it as a roomful of Alto's on a chip. Sending a message to an object (in Squeak implemented with bytecodes) can be handled in many different ways by the microcode and underlying hardware ring network. A message send can be sent to any object anywhere. It can be in the cache, in the local object heap, in another core or in cores reachable through the external links to neighboring SiliconSqueak units and routed by the hardware until it reaches the location of the object. If the object is external to interconnected the clusters of SiliconSqueaks forming a supercomputer, it can also be handed off to Smalltalk code running in the image, that can then send it as IP packets over the internet to a remote Squeak image (that itself may be running on a manycore SiliconSqueak or software implemented Squeak VM). A single Squeak image can at runtime distribute its objects among the memories of all cores of all SiliconSqueak processors. A message sends to an remote object would run code on the remote core, achieving parallelism that can be transparent to the Squeak programmer. However, it can also be utilized for explicit forms of parallelism. By extending the functionality of the message send behavior different (parallel) programming models can be accommodated (for example future sends as in Actors). Other auto-tuning algorithms can (transparently) redistribute objects among the cores to implement other ways to exploit fine or course grained parallelism. It is also an option to have multiple images running on the system, some on a single core, some using multiple cores. Intercommunicating with external Squeak images on VMs on Unix or other operating systems will appear from the programmers point of view as transparent as among SiliconSqueak cores.

SiliconSqueak is a power efficient, parallel, reconfigurable architecture optimized for adaptive compilation. The system includes a mix of basic and extended processors, where these extensions are configurable accelerators like the 64 ALU matrix. Other accelerators like a vector processor, a graphics processor, a MPEG encode/decoder, an encrypt/decrypt processor or an FPU can be implemented as hardware. These accelerators can also be implemented in microcode with the 64 ALU matrix. As these accelerators communicate through the ring network, they can be daisy chained to form larger matrixes. In FPGA implementations the ratio of SiliconSqueak cores and accelerators can changes at runtime under control of the adaptive compilation. So bytecodes will initially be interpreted, could then be recompiled into microcoded polymorphic inline caches. A second recompilation could implement the code on the 64 ALU matrix. A third level analysis can identify hotspots and reconfigure the FPGA hardware accordingly while the code is still running, replacing some SiliconSqueak cores with 64 ALU matrixes other configurable accelerators and vice versa.

In the much faster ASIC implementations you are stuck with the choices Jecel and I will make based on results of running many Squeak images and having the adaptive compilation help find the optimal balance between the number of cores and configurable accelerators.

We ourselves will produce some low cost FPGA and ASIC systems this quarter. First a scalable cluster of 8 core SiliconSqueak FPGA's that you can interconnect with backplanes into supercomputers. Next a very low cost 16 core ASIC version with 10 Gbps Thunderbolt optical links. But you can just as easily use many other medium and high cost FPGA development boards out there, ranging from $80 (1 core) to $25,000 (400 cores, 88 x 28 Gbps for a total of 2,7 Tbps per chip). Most brands of FPGA can accommodate SiliconSqueak easily. An FPGA core will soon not fall below $10 per core, but ASICs can reach fractions of a dollar cent. In future WSI (Wafer Scale Integration) the number of cores and size of cache or memory will be fixed, although a changing percentage of these cores will not function because of the unavoidable damage to some parts on the wafer. Depending on the amount of memory per core, on the size of wafer (8-30 inch) and the process used, thousands to millions of cores are possible. Eventually arriving at a computer the size and price of an iPad with a million cores is entirely feasible.

Casey, why not just build it?

Our small, self funded research group designed these SiliconSqueak hardware systems for the next stages of the research in massively parallel, message passing, late bound dynamic software systems, growing more and more away from standard Squeak but retaining backwards compatibility without restricting us.

We welcome anyone who wants to to implement their own bytecode language systems like Frank, Lisp, etc and we invite people to collaborate with us on our own Squeak based research.

Merik Voswinkel





_______________________________________________
fonc mailing list
[email protected]
http://vpri.org/mailman/listinfo/fonc

Reply via email to