@Greg: Can you describe at what points the JobManager struggled heavily? I would guess that it is at some point during deployment, that deployment takes longer than you expected?
On Wed, Oct 21, 2015 at 10:14 AM, Maximilian Michels <m...@apache.org> wrote: > Hi Greg, > > It would be very interesting to do a profiling of the job master to > see what it mostly spends time on. Did you run your experiments with > 0.9.X or the 0.10-SNAPSHOT? Would be interesting to know if there is a > regression. > > Best, > Max > > On Wed, Oct 21, 2015 at 10:08 AM, Till Rohrmann <trohrm...@apache.org> > wrote: > > Hi Greg, > > > > there is no official guide for running Flink on large clusters. As far > as I > > know, the cluster we used for the matrix factorization was the largest > > cluster we've run a serious job on. Thus, it would be highly interesting > to > > understand what made the JobManager to slow down. At some point, though, > > this should happen since the JobManager always stays a single instance. > Do > > you have by chance access to the JobManager log file? This might be > helpful. > > > > Thanks for your help, > > Till > > > > On Tue, Oct 20, 2015 at 11:06 PM, Greg Hogan <c...@greghogan.com> wrote: > > > >> Is there guidance for configuring Flink on large clusters? I have > recently > >> been working to benchmark some algorithms on and test AWS. I had no > issues > >> running on a 16 node cluster but when moving to 64 nodes the JobManager > >> struggled mightily. It did not look to be parallelizing its workload. I > was > >> in the process of modifying my code to reduce the parallelism of > earlier, > >> smaller operations when I lost the cluster due to a spot price increase. > >> > >> The instances were c3.8xlarge and in the larger cluster one instance > hosted > >> the JobManager so the parallelism was 63 * 32 = 2016. The small cluster > had > >> parallelism of 512. > >> > >> I have seen the blog posts describing the performance of 640 core > clusters > >> on GCE. Is this a known limitation or can Flink scale much further? > >> > >> > >> > http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/ > >> > >> > >> > http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/ > >> > >> Thanks, > >> Greg > >> >