I would set the DOP to the number of cores.
On Tue, Jul 8, 2014 at 9:42 AM, Kruse, Sebastian <sebastian.kr...@hpi.de> wrote: > Hi, > > I admit, it really is already quite a lot :) > > However, my task at hand is inclusion dependency detection on CSV files > and the number of such files in real-world datasets is sometimes even > higher. Since each file can have a different number of columns and since I > need to distinguish the columns from all files, I am starting a source per > file. > > How would you recommend to set the DOP for a cluster? Number of machines? > Number of cores? Number of cores*2? > > Cheers, > Sebastian > > -----Original Message----- > From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf Of > Stephan Ewen > Sent: Montag, 7. Juli 2014 19:44 > To: dev@flink.incubator.apache.org > Subject: Re: Hardware Requirements > > Hi! > > Okay, 100 concurrent data sources is quite a lot ;-) > > Do you start a source per file? You can start a source per directory, > which will take all files in the directory... > > Stephan > > > On Mon, Jul 7, 2014 at 7:41 PM, Kruse, Sebastian <sebastian.kr...@hpi.de> > wrote: > > > Thanks for your answers. Based on what you say, I guess the scaling > > problem in my program is the number of data sources. This number is > > variable and can go beyond 100 (I am analyzing data dumps). Maybe, the > > number of shuffles or something similar will grow with the number of > > sources or simply because it inflates the plan. That would explain, > > why the execution fails for the larger datasets. > > > > I am running 10 TaskManagers. Since these have dual-core CPUs and I > > thought, I chose 20 as DOP, and was even thinking about 40 for latency > > hiding. What DOP would you suggest for this setting (disregarding the > > buffer limitation)? > > > > Pertaining to the number of concurrent shuffles, I would also like to > > know what causes a shuffle. Reduces, cogroups, and joins? And what about > unions? > > > > If you are interested, I can play around a little bit more with the > > settings by the end of this week and report to you, under which > > circumstances the execution fails or passes. > > (Update: the program just passed with 16000 buffers and a DOP of 10) > > > > Cheers, > > Sebastian > > > > > > -----Original Message----- > > From: Ufuk Celebi [mailto:u.cel...@fu-berlin.de] > > Sent: Sonntag, 6. Juli 2014 14:30 > > To: dev@flink.incubator.apache.org > > Subject: Re: Hardware Requirements > > > > Hey Sebastian, > > > > did you already try to increase the number of buffers in accordance to > > Stephan's suggestion? The current defaults for the number and size of > > network buffers are 2048 and 32768 bytes, resulting in 64 MB of memory > > for the network buffers. > > > > Out of curiosity: on how many machines are you running your job and > > what parallelism did you set for your program? > > > > Best, > > > > Ufuk > > > > On 04 Jul 2014, at 15:46, Kruse, Sebastian <sebastian.kr...@hpi.de> > wrote: > > > > > Hi everyone, > > > > > > I apologize in advance if that is not the right mailing list for my > > question. If there is a better place for it, please let me know. > > > > > > Basically, I wanted to ask if you have some statement about the > > > hardware > > requirements of Flink to process larger amounts of data beginning > > from, say, 20 GBs. Currently, I am facing issues in my jobs, e.g., > > there are not enough buffers for safe execution of some operations. > > Since the machines that run my TaskTrackers have unfortunately very > > limited main memory, I cannot increase the number of buffers (and heap > space in general) too much. > > Currently, I assigned them 1.5 GB. > > > > > > So, the exact questions are: > > > > > > * Do you have experiences with a suitable HW setup for > crunching > > larger amounts of data, maybe from the TU cluster? > > > > > > * Are there any configuration tips, you can provide, e.g. > > pertaining to the buffer configuration? > > > > > > * Are there any general statements on the growth of Flink's > > memory requirements wrt. to the size of the input data? > > > > > > Thanks for your help! > > > Sebastian > > > > >