Hi Ed, Just so it doesn't seem to everyone that I'm ignoring your concerns, here's a summary that I think covers the main points in your document:
ForkJoinPool is an ExecutorService (as are others in java.util.concurrent, such as ThreadPoolExecutor). It differs from others mainly by virtue of using work-stealing, which is a decentralized scheduling technique that best applies to directed acyclic parallel computations. Work-stealing has evolved since the 1990s, and is now used in many lightweight task frameworks; including for example the custom scheduler used in hotspot's parallel garbage collectors. The main twist in FJ compared to most others is linking work-stealing to the use of Futures (ForkJoinTasks), which provides a more accessible and pleasant basis for creating applications and layered libraries. (Although FJ can also be used in other ways, for example with completion callbacks, quiescence monitoring, etc rather the Future-based join operation). The framework is more complicated internally than most Executors because it must efficiently manage decentralized state to support these diverse usages. For an academic/hacker, this was the fun part. The ForkJoin framework has comparable support for errors and unexpected behavior as other ExecutorServices. The main differences are that the specs are clearer about the conditions under which programs can stall (i.e., when they are stuck on IO or are not DAG-structured), that we include some (evadable) nuisances (like not allowing IOExecptions) to encourage appropriate use, and we provide a means by which advanced users can further intervene (via ManagedBlocker) The performance of the framework is pretty good (and improves over time). One nice property of (some, not all) work-stealing based Fork/Join computations is that they are have good locality and so their performance is not very sensitive to the many ways that caches can be arranged on multicores and multiprocessors. Also, while I hope to avoid benchmark arguments, I note that you cite only a comparison of two completely different algorithms. The FJ demo uses a variant the Blelloch parallel prefix algorithm, that I find to outperform statically partitioned ones (as I believe you used) only with more than 8 cores or so. If you think I've missed any of your concerns, or you (or anyone else) have other concerns, please feel free to ask. -Doug