Hi Greg,

there is no official guide for running Flink on large clusters. As far as I
know, the cluster we used for the matrix factorization was the largest
cluster we've run a serious job on. Thus, it would be highly interesting to
understand what made the JobManager to slow down. At some point, though,
this should happen since the JobManager always stays a single instance. Do
you have by chance access to the JobManager log file? This might be helpful.

Thanks for your help,
Till

On Tue, Oct 20, 2015 at 11:06 PM, Greg Hogan <c...@greghogan.com> wrote:

> Is there guidance for configuring Flink on large clusters? I have recently
> been working to benchmark some algorithms on and test AWS. I had no issues
> running on a 16 node cluster but when moving to 64 nodes the JobManager
> struggled mightily. It did not look to be parallelizing its workload. I was
> in the process of modifying my code to reduce the parallelism of earlier,
> smaller operations when I lost the cluster due to a spot price increase.
>
> The instances were c3.8xlarge and in the larger cluster one instance hosted
> the JobManager so the parallelism was 63 * 32 = 2016. The small cluster had
> parallelism of 512.
>
> I have seen the blog posts describing the performance of 640 core clusters
> on GCE. Is this a known limitation or can Flink scale much further?
>
>
> http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
>
>
> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/
>
> Thanks,
> Greg
>

Reply via email to