On Mon, Mar 7, 2016 at 5:58 AM, Justin Y. Shi <[email protected]> wrote: > Peter: > > Thanks for the questions. > > The impossibility was theoretically proved that it is impossible to > implement reliable communication in the face of [either sender or receiver] > crashes. Therefore, any parallel or distributed computing API that will > force the runtime system to generate fixed program-processor assignments are > theoretically incorrect. This answer is also related to your second > question: the impossibility means 100% reliable communication is impossible. > > Ironically, 100% reliable packet transmission is theoretically and > practically possible as proved by John and Nancy for John's dissertation. > These two seemingly conflicting results are in fact complementary. They > basically say that distributed and parallel application programming cannot > rely on the reliable packet transmission as all of our current distributed > and parallel programming APIs assume. > > Thus, MPI cannot be cost-ineffective in proportion to reliability, because > of the impossibility. The same applies to all other APIs that allows direct > program-program communications. We have found that the <key, value> APIs are > the only exceptions for they allow the runtime system to generate dynamic > program-device bindings, such as Hadoop and Spark. To solve the problem > completely, the application programming logic must include the correct > retransmission discipline. I call this Statistic Multiplexed Computing or > SMC. The Hadoop and Spark implementations did not go this far. If we do > complete the paradigm shift, then there will be no single point failure > regardless how the application scales. This claim covers all computing and > communication devices. This is the ultimate extreme scale computing > paradigm. > > These answers are rooted in the statistic multiplexing protocol research > (packet switching). They have been proven in theory and practice that 100% > reliable and scalable communications are indeed possible. Since all HPC > applications must deploy large number of computing units via some sort of > interconnect (HP's The Machine may be the only exception), the only correct > API for extreme scale HPC is the ones that allow for complete > program-processor decoupling at runtime. Even the HP machine will benefit > from this research. Please note that the 100% reliability is conditioned by > the availability of the "minimal viable set of resources". In computing and > communication, the minimal set size is 1 for every critical path. > > My critics argued that there is no way statistic multiplexed computing > runtime can compete against bare metal programs, such as MPI. We have > evidences to prove the opposite. In fact SMC runtime allows dynamic > adjustments of processing granularity without reprogramming. Not only we can > prove faster performances using heterogeneous processor but also homogeneous > processors. We see this capability is critical for extracting efficiency out > of HPC clouds.
I always get lost in the fancy words of research papers - Is the source to any of this open? How could just a normal guy like me reproduce or independently verify your results? I'm not sure at what level you're talking sometimes - at the network level we have things like TCP (instead of UDP) when it comes to ensuring that packet level reliability is ensured - at a data level you have ACID compliant databases for storage.. there are lots of technology on the "web" side which are commonly used and required since the "internet" is inherently unstable/unreliable. In my mind "exascale" machines will need to be programmed with a more open view of what is or isn't. Should we continue to model everything around the communication or switch focus to more of resolving data dependencies and locality.. _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
