Just to make sure that there is no confusion: with Aljoscha's refactoring the Scala API will be a thin layer ontop of the Java API and should have comparable/same performance as the Java API (one difference is that Scala tuples are immutable whereas Java tuples are mutable and instances can be reused).
I don't want someone in the future reading this thread to think that the Scala API incurs a large performance hit. ;) On Mon, Sep 8, 2014 at 5:13 PM, Aljoscha Krettek <[email protected]> wrote: > Ok. > > My work is available here: > https://github.com/aljoscha/incubator-flink/tree/scala-rework > > Please look at the WordCount and KMeans example to see how the API has > changed but basically only the way you create Data Sources is changed. > > I'm looking forward to your feedback. :D > > On Mon, Sep 8, 2014 at 4:22 PM, Norman Spangenberg > <[email protected]> wrote: > > I tried different values for the numberOfTaskSlots (1, 2, 4, 8) and DOP > to > > optimize flink. > > @Aljoscha: it would be great to try out the new Scala-API for flink. I > wrote > > already some other apps in scala, so I doesn't have to rewrite them. > > > > Am 08.09.2014 16:13, schrieb Robert Metzger: > > > >> There is probably a little typo in Aljoscha's answer. The > >> taskmanager.numberOfTaskSlots should be 8 (there are 8 cores per > machine) > >> The parallelization.degree.default is correct. > >> > >> On Mon, Sep 8, 2014 at 4:09 PM, Aljoscha Krettek <[email protected]> > >> wrote: > >> > >>> Hi Norman, > >>> I saw you were running our Scala Examples. Unfortunately those do not > >>> run as well as our Java examples right now. The Scala API was a bit of > >>> a prototype that has some issues with efficiency. For now, you could > >>> maybe try running our Java examples. > >>> > >>> For your cluster, good configuration values would be numberOfTaskSlots > >>> = 4 (number of CPU cores) and parallelization.degree.default = 32 > >>> (number of nodes X number of CPU cores). > >>> > >>> The Scala API is being rewritten for our next release, so if you > >>> really want to check out Scala examples I could point you to my > >>> personal branch on github where development of the new Scala API is > >>> taking place. > >>> > >>> Cheers, > >>> Aljoscha > >>> > >>> On Mon, Sep 8, 2014 at 2:48 PM, Norman Spangenberg > >>> <[email protected]> wrote: > >>>> > >>>> Hello, > >>>> I'm a bit confused about the performance of Flink. > >>>> My cluster consists of 4 nodes, each with 8 cores and 16gb memory (1.5 > >>>> gb > >>>> reserved for OS). using flink-0.6 in standalone-cluster mode. > >>>> i played a little bit with the config-settings but without much impact > >>>> on > >>>> execution time. > >>>> flink-conf.yaml: > >>>> jobmanager.rpc.port: 6123 > >>>> jobmanager.heap.mb: 1024 > >>>> taskmanager.heap.mb: 14336 > >>>> taskmanager.memory.size: -1 > >>>> taskmanager.numberOfTaskSlots: 4 > >>>> parallelization.degree.default: 16 > >>>> taskmanager.network.numberOfBuffers: 4096 > >>>> fs.hdfs.hadoopconf: /opt/yarn/hadoop-2.4.0/etc/hadoop/ > >>>> > >>>> I tried two applications: wordcount and k-Means scala example code > >>>> wordcount needs 5 minutes for 25gb, and 13 minutes for 50gb. > >>>> kmeans (10 iterations) needs for 56mb input 86 seconds, but with 1.1gb > >>> > >>> input > >>>> > >>>> it needs 33minutes with 2.2gb nearly 90 minutes! > >>>> > >>>> the monitoring tool ganglia says, that cpu has low cpu utilization > and a > >>> > >>> lot > >>>> > >>>> of waiting time. in wordcount cpu utilizes with nearly 100 percent. > >>>> Is this a ordinary dimension of execution time in spark? or are > >>>> optimizations in my config necessary? or maybe a bottleneck in the > >>> > >>> cluster? > >>>> > >>>> i hope somebody could help me :) > >>>> greets Norman > > > > >
