(sending this again since previous attempt seemed bumped back) Hi folks,
As all of you we are super excited to use Mesos to manage thousands of different applications on our large-scale clusters. When the application and host amount keeps increasing, we are getting more and more curious about what would be the potential scalability limit/bottleneck to Mesos' centralized architecture and what is its robustness in the face of various failures. If we can identify them in advance, probably we can manage and optimize them before we are suffering in any potential performance degradations. To explore Mesos' capability and break the knowledge gap, we have a proposal to evaluate Mesos scalability and robustness through stress test, the draft of which can be found at: draft_link <https://docs.google.com/document/d/10kRtX4II74jfUuHJnX2F5teqpXzHYFQAZGWjCdS3cZA/edit?usp=sharing>. Please feel free to provide your suggestions and feedback through comment on the draft. Probably many of you have similar questions as we have. We will be happy to share our findings in these experiments with the Mesos community. Please stay tuned. -- Cheers, Ao Ma & Zhitao Li