On Tue, Nov 12, 2013 at 11:07 AM, Stephen Haberman < stephen.haber...@gmail.com> wrote:
> Huge disclaimer that this is probably a big pita to implement, and > could likely not be as worthwhile as I naively think it would be. > My perspective on this is it's already big pita of Spark users today. In the absence of explicit directions/hints, Spark should be able to make ballpark estimates and conservatively pick # of partitions, storage strategies (e.g., memory vs disk) and other runtime parameters that fit the deployment architecture/capacities. If this requires code and extra runtime resources for sampling/measuring data, guestimating job size, and so on, so be it. Users want working jobs first. Optimal performance / resource utilization follow from that.