Is it possible that your tasks are not falling evenly over the machines of your cluster, but piling up on a small number of machines?
On Tue, Jul 28, 2009 at 3:35 PM, Scott Carey <[email protected]>wrote: > See below: > > > On 7/28/09 12:15 PM, "william kinney" <[email protected]> wrote: > > > Sorry, forgot to include that detail. > > > > Some data from ganglia: > > > > CPU: > > - on all 10 nodes, I am seeing for the life of the job 85-95% CPU > > usage, with about 10% of that being "System" CPU, vs "User". > > - Single node graph: http://imagebin.org/57520 > > - Cluster graph: http://imagebin.org/57523 > > Ok, CPU is definitely loaded. Identify which processes are primarily > responsible (Tasks? Datanode? Tasktracker?) You'll want to make the > processes eating CPU during a run spit out some stack traces to 'profile' > the activity. Use either the 'jstack' utility with the JDK, or do a 'kill > -3 <pid>' on a java process to spit out the stack trace to stdout. You'll > want to do this a handful of times on a single job if possible to identify > any trends. > > > > > Memory: > > - Memory used before job is about 0.4GB, During job it fluctuates > > up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node > > memory (8GB) is showing as "Cached". > > - Single node graph: http://imagebin.org/57522 > > So the OS is mostly just caching disk files in RAM. > > > > > Network: > > - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s. > > - Single node graph: http://imagebin.org/57521 > > - Cluster graph: http://imagebin.org/57525 > > > > That is a not insignificant, but cumulative across the cluster its not > much. > > > iostat (disk) (sampled most of the nodes, below values are ranges I saw): > > tps: 0.41-1.27 > > Blk_read/s: 46-58 > > Blk_wrtn/s: 20-23 > > (have two disks per node, both SAS, 10k RPM) > > > > Did you do iostat with a parameter to have it spit out more than one row? > By default, it spits out data averaged since boot time, like vmstat. > My favorite iostat params for monitoring are: > iostat -mx 5 > iostat -dmx 5 > (or 10 or 15 or 60 second intervals depending on what I'm doing) Ganglia > might have some I/O info -- you want both iops and some sort of bytes/sec > measurement. > > > --- > > Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second? > > > > I think its the 512 byte block notion, but I always use -m to put it in > useful units. > > > Also, from the job page (different job, same Map method, just more > > data...~40GB. 781 files): > > Map input records 629,738,080 > > Map input bytes 41,538,992,880 > > > > Anything else I can look into? > > Based on your other email: > > There are almost 800 map tasks, these seem to mostly be data local. The > current implementation of the JobTracker schedules rather slowly, and can > at > best place one new task per node per 2 seconds or so on a small cluster. > So, with 10 servers, it will take at least 80 seconds just to schedule all > the tasks. > If each server can run 8 tasks concurrently, then if the average task > doesn¹t take somewhat longer than 16 seconds, the system will not reach > full > utilization. > > What does the web interface tell you about the number of concurrent map > tasks during the run? Does it approach the max task slots? > > You can look at the logs for an individual task, and see how much data it > read, and how long it took. It might be hitting your 50MB/sec or close in > a > burst, or perhaps not. > > Given the sort of bottlenecks I often see, I suspect the scheduling. But, > you have almost maxed CPU use, so its probably not that. Getting stack > dumps to see what the processor is doing during your test will help narrow > it down. > > > > > > Do my original numbers (only 2x performance) jump out at you as being > > way off? Or it is common to see that a setup similar to mine? > > > > I should also note that given its a custom binary format, I do not > > support Splitting (isSplittable() is false). I don't think that would > > count for such a large discrepancy in expected performance, would it? > > > > If the files are all larger than the block size, it would cause a lot more > network activity -- but unless your switch or network is broken or not > gigabit -- there is a lot of capacity left in the network. > > > Thanks for the help, > > Will > > > > > > On Tue, Jul 28, 2009 at 12:58 PM, Scott Carey<[email protected]> > wrote: > >> Well, the first thing to do in any performance bottleneck investigation > is > >> to look at the machine hardware resource usage. > >> > >> During your test, what is the CPU use and disk usage? What about > network > >> utilization? > >> Top, vmstat, iostat, and some network usage monitoring would be useful. > It > >> could be many things causing your lack of scalability, but without > actually > >> monitoring your machines to see if there is an obvious bottleneck its > just > >> random guessing and hunches. > >> > >> > >> > >> On 7/28/09 8:18 AM, "william kinney" <[email protected]> wrote: > >> > >>> Hi, > >>> > >>> Thanks in advance for the help! > >>> > >>> I have a performance question relating to how fast I can expect Hadoop > >>> to scale. Running Cloudera's 0.18.3-10. > >>> > >>> I have custom binary format, which is just Google Protocol Buffer > >>> (protobuf) serialized data: > >>> > >>> 669 files, ~30GB total size (ranging 10MB to 100MB each). > >>> 128MB block size. > >>> 10 Hadoop Nodes. > >>> > >>> I tested my InputFormat and RecordReader for my input format, and it > >>> showed about 56MB/s performance (single thread, no hadoop, passed in > >>> test file via FileInputFormat instead of FSDataInputStream) on > >>> hardware similar to what I have in my cluster. > >>> I also then tested some simple Map logic along w/ the above, and got > >>> around 54MB/s. I believe that difference can be accounted for parsing > >>> the protobuf data into java objects. > >>> > >>> Anyways, when I put this logic into a job that has > >>> - no reduce (.setNumReduceTasks(0);) > >>> - no emit > >>> - just protobuf parsing calls (like above) > >>> > >>> I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. > >>> > >>> So my question, why is the rate only 2x what I see on a single thread, > >>> non-hadoop test? Would it not be: > >>> 54MB/s x 10 (Num Nodes) - small hadoop overhead ? > >>> > >>> Is there any area of my configuration I should look into for tuning? > >>> > >>> Anyway I could get more accurate performance monitoring of my job? > >>> > >>> On a side note, I tried the same job after combining the files into > >>> about 11 files (still 30GB in size), and actually saw a decrease in > >>> performance (~90MB/s). > >>> > >>> Any help is appreciated. Thanks! > >>> > >>> Will > >>> > >>> some hadoop-site.xml values: > >>> dfs.replication 3 > >>> io.file.buffer.size 65536 > >>> dfs.datanode.handler.count 3 > >>> mapred.tasktracker.map.tasks.maximum 6 > >>> dfs.namenode.handler.count 5 > >>> > >> > >> > > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
