Re: Map performance with custom binary format

Todd Lipcon Thu, 30 Jul 2009 11:52:26 -0700

On Thu, Jul 30, 2009 at 11:39 AM, Scott Carey <[email protected]>wrote:


>
>
> Use the deadline scheduler:
> # echo 'deadline' > /sys/block/sda/queue/scheduler    (for each device)
>

Have you found the deadline scheduler to be significantly better than the
default cfq? I've used deadline for RDBMS workloads, but I wouldn't imagine
the difference would be distinct for Hadoop.

-Todd


> I have found that XFS is also superior to ext3 with concurrent reads or
> writes, if tuned right.
> Mount options: noatime,allocsize=32m,logbufs=8
>  But since you seem CPU bound and not I/O bound this probably isn't an
> issue.
>
> If even a single task on a single large file is slower in MB/sec than your
> test program, then I suspect read/write buffer issues or misuse somewhere.
> (Just a SWAG though).
>
> -Scott
>
> >
> > Thanks!
> > Will
> >
> >
> >
> > On Thu, Jul 30, 2009 at 10:37 AM, william
> > kinney<[email protected]> wrote:
> >> Local is executed on a Hadoop node (when no job is running), So same
> >> JRE/hardware.
> >>
> >> JRE:
> >> java version "1.6.0_13"
> >> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> >> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
> >>
> >> JVM arguments for child task:
> >> /usr/java/jdk1.6.0_13/jre/bin/java
> >>
> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/m
> >>
> apred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0
> >> 048_m_000008_0/work
> >> -Xmx486m
> >>
> -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_20090724
> >> 2015_0048/attempt_200907242015_0048_m_000008_0/work/tmp
> >> -classpath
> >>
> /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/had
> >>
> oop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli-
> >>
> 2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/li
> >>
> b/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar:
> >>
> /usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons
> >>
> -net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairschedu
> >>
> ler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar:
> >>
> /usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/
> >>
> hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/
> >>
> lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/my
> >>
> sql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/h
> >>
> adoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/ha
> >>
> doop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib
> >>
> /hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-com
> >>
> piler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/li
> >>
> b/jetty-ext/jsp-api.jar::/disk1/hadoop/mapred/local/taskTracker/jobcache/job_
> >>
> 200907242015_0048/jars/classes:/disk1/hadoop/mapred/local/taskTracker/jobcach
> >>
> e/job_200907242015_0048/jars:/disk2/hadoop/mapred/local/taskTracker/archive/m
> >>
> aster.ph-cloud.detica.us.com/tmp/hadoop-hadoop/mapred/system/job_200907242015
> >>
> _0048/libjars/ph-gpb-1.0.5.jar/ph-gpb-1.0.5.jar:/disk2/hadoop/mapred/local/ta
> >>
> skTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_20090
> >>
> 7242015_0048/libjars/protobuf-java-2.0.3.jar/protobuf-java-2.0.3.jar:/disk1/h
> >>
> adoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/ma
> >>
> pred/system/job_200907242015_0048/libjars/commons-lang-2.4.jar/commons-lang-2
> >>
> .4.jar:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/
> >> attempt_200907242015_0048_m_000008_0/work
> >> -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.root.logger=INFO,TLA
> >> -Dhadoop.tasklog.taskid=attempt_200907242015_0048_m_000008_0
> >> -Dhadoop.tasklog.totalLogFileSize=0
> >> org.apache.hadoop.mapred.TaskTracker$Child 127.0.0.1 52698
> >> attempt_200907242015_0048_m_000008_0
> >>
> >> child task heap: -Xmx486m
> >>
> >> (mapred.child.ulimit is 995328)
> >>
> >> Local call has no JVM arguments, just:
> >> java -cp <myjar>.jar com......RecordReaderTest <fileToTest>
> >>
> >> Hardware platform:
> >> Linux 2.6.27.24-170.2.68.fc10.x86_64
> >> 8GB
> >> Dual Core Xeon E5205
> >> 2 x 300GB SAS 10k, no RAID.
> >> (HP ProLiant blade)
> >>
> >> Data is not compressed.
> >>
> >> JobTracker:
> >> Running: Started around 20, but as the job progressed it slowly
> >> increased to at the end: 432 (when Pending was 0). Running dropped to
> >> 0/Status was marked Succeeded about 10 seconds after that. Is this
> >> normal? The total # of Tasks was 1449.
> >>
> >> Stack Traces.
> >> Looked at about 20 stack traces from 2 different nodes. Consistently
> saw:
> >> 2 x org.apache.hadoop.dfs.DFSClient$LeaseChecker @ Thread.sleep()
> >> "Comm thread for attempt_200907242015_0050_m_001409_0" @ Thread.sleep()
> >> "IPC Client (47) connection to <master-hostname>/192.168.1.100:8020
> >> from wkinney" @ Object.wait()
> >> "IPC Client (47) connection to /127.0.0.1:49202 from an unknown user"
> >> @ Object.wait()
> >> VM, GC, Signal Dispatcher, Low Memory Detector, CompilerThread,
> >> Finalizer, Reference Handler...
> >>
> >> Then would sometimes see FastDateFormat thread, parseFrom(), or
> >> somewhere near there (e.g. MapRunner.run())
> >>
> >> Finally, I consistently saw this:
> >> "Thread-5" daemon prio=10 tid=0x0000000040bbfc00 nid=0x2f87 in
> >> Object.wait() [0x00007fb7498ce000..0x00007fb7498cebf0]
> >>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >>        at java.lang.Object.wait(Native Method)
> >>        - waiting on <0x00007fb769fdec00> (a java.util.LinkedList)
> >>        at
> >>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.ja
> >> va:1905)
> >>        - locked <0x00007fb769fdec00> (a java.util.LinkedList)
> >> I'm guessing this is normal DataNode activity...
> >>
> >> Will
> >>
> >>
> >> On Thu, Jul 30, 2009 at 1:31 AM, Scott Carey<[email protected]>
> wrote:
> >>> What is the JRE for the Hadoop nodes versus local?  What are the JVM
> >>> arguments for the child tasks and the local version (and heap size)?
>  What
> >>> is the hardware and platform details for the nodes versus the local
> test?
> >>> Is the data compressed in Hadoop (check the config)?
> >>>
> >>> You mention the TaskTracker web inerface during a job, but what about
> the
> >>> JobTracker interface?  This should show the global view of currently
> >>> scheduled maps versus total slots.
> >>>
> >>> Lastly, check out some more stack traces on the tasks.  If they are all
> >>> still in the DateFormat stuff?  Surely some of them should be in your
> >>> parseFrom() method too?
> >>>
> >>>
> >>> On 7/29/09 9:07 PM, "william kinney" <[email protected]> wrote:
> >>>
> >>> OK:
> >>>  implemented some iotop/iostat monitoring in ganglia. Looks pretty
> >>> standard (job was 23:00 to 23:06):
> >>>   - Single Node Disk Read: http://imagebin.org/57716
> >>>   - Single Node Disk Write: http://imagebin.org/57717
> >>>
> >>> On each node, noticed that the two TaskTracker$Child processes were
> >>> consuming close to 90% of each core. The TaskTracker and DataNode were
> >>> close to 0%. For the TT children, I did jstack dumps, but didn't
> >>> really see much that popped out other than a lot of time spent in a
> >>> SimpleDateFormat section and the protobuf parse. I switched the SDF
> >>> out with commons.lang FastDateFormat, which reduced the total time for
> >>> both the Hadoop job and the local/non-hadoop test, so still a
> >>> discrepancy between local and hadoop runs.
> >>>
> >>> "You can look at the logs for an individual task, and see how much data
> it
> >>> read, and how long it took.  It might be hitting your 50MB/sec or close
> in a
> >>> burst, or perhaps not."
> >>>  - I decided to log the performance of each RecordReader use within
> >>> hadoop, which is essentially 1:1 for TaskTracker$Child process since I
> >>> have 1 InputSplit per file (ie, no splitting), right?. Saw:
> >>> Example 1) 527639090 bytes in : 18050 ms. (27.8778 MB/s)
> >>> Example 2) 533770314 bytes in : 23494 ms. (21.6669 MB/s)
> >>> Example 3) 529711886 bytes in : 20092 ms. (25.1429 MB/s)
> >>> ...etc
> >>> For reference, the non-hadoop/local test:
> >>> 530710906 bytes in : 9133 ms. (55.41721 MB/s)
> >>>
> >>> Regarding the JobTracker only doing 1 task / node / 2 seconds, that
> >>> will definitely hurt. Although the above discrepancy takes priority
> >>> for me, for now.
> >>>
> >>> "What does the web interface tell you about the number of concurrent
> map
> >>> tasks during the run?  Does it approach the max task slots?"
> >>>  - Yeah it definitely does, from the TaskTracker page on each node,
> >>> I'm seeing almost always 2 "RUNNING" tasks (and an accumulating list
> >>> of "COMMIT_PENDING" tasks under Non-Running, which slowly grows as the
> >>> job progresses). Normal?
> >>>
> >>> Also, I used a profiler to profile a local/non-hadoop test of the
> >>> RecordReader/Map():
> >>>  class: %Time
> >>>      org.apache.commons.lang.time.FastDateFormat.format(long):  46%
> >>>      com......parseFrom(byte[]):  42%
> >>>      java.io.FileInputStream.read(byte[], int, int): 5%
> >>>      ...rest are 1%'ish
> >>>  I guess this doesn't show anything helpful. I'll try to attach it to
> >>> hadoop remotely...anyone have any experience doing this w/ YourKit
> >>> Java Profiler?
> >>>
> >>> Anyways, decided to test the "large files" vs "small files" theory
> again:
> >>>  Small files (1449 files, ranging 10-100MB. average: 32 MB)
> >>>    - HDFS bytes read  49,057,491,374
> >>>    - Map input records  737,850,142
> >>>    - Finished in: 7mins, 26sec
> >>>    ... 104.898 MB/s
> >>>  Large files (22 files, around 500MB. average 514MB)
> >>>    - HDFS bytes read  11,852,421,152
> >>>    - Map input records 179,657,432
> >>>    - Finished in: 1mins, 8sec
> >>>    ... 166.225 MB/s
> >>>
> >>>   Not sure why before the large files were taking longer, perhaps the
> >>> SimpleDateFormat>FastDateFormat change? Anyways, good to see where I
> >>> need to take the file sizes too...but still 166 MB is not the rate I
> >>> was hoping for (given the # of nodes and local performance).
> >>>
> >>> So I guess in summary, hadoop TaskTracker$Child processes that are
> >>> doing the Map() and RecordReader are about 50% slower than the normal,
> >>> local non-hadoop version. In addition, their rate (~25MB/s) * Num
> >>> Nodes (10) suggests ~ 250MB/s total job performance, but I'm only
> >>> seeing ~166MB/s.
> >>>
> >>> Will
> >>>
> >>> On Tue, Jul 28, 2009 at 6:35 PM, Scott Carey<[email protected]>
> wrote:
> >>>> See below:
> >>>>
> >>>>
> >>>> On 7/28/09 12:15 PM, "william kinney" <[email protected]>
> wrote:
> >>>>
> >>>>> Sorry, forgot to include that detail.
> >>>>>
> >>>>> Some data from ganglia:
> >>>>>
> >>>>>   CPU:
> >>>>>     - on all 10 nodes, I am seeing for the life of the job 85-95% CPU
> >>>>> usage, with about 10% of that being "System" CPU, vs "User".
> >>>>>     - Single node graph: http://imagebin.org/57520
> >>>>>     - Cluster graph: http://imagebin.org/57523
> >>>>
> >>>> Ok, CPU is definitely loaded.  Identify which processes are primarily
> >>>> responsible (Tasks? Datanode? Tasktracker?) You'll want to make the
> >>>> processes eating CPU during a run spit out some stack traces to
> 'profile'
> >>>> the activity.  Use either the 'jstack' utility with the JDK, or do a
> 'kill
> >>>> -3 <pid>' on a java process to spit out the stack trace to stdout.
>  You'll
> >>>> want to do this a handful of times on a single job if possible to
> identify
> >>>> any trends.
> >>>>
> >>>>>
> >>>>>   Memory:
> >>>>>     - Memory used before job is about 0.4GB, During job it fluctuates
> >>>>> up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node
> >>>>> memory (8GB) is showing as "Cached".
> >>>>>     - Single node graph: http://imagebin.org/57522
> >>>>
> >>>> So the OS is mostly just caching disk files in RAM.
> >>>>
> >>>>>
> >>>>>   Network:
> >>>>>     - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s.
> >>>>>     - Single node graph: http://imagebin.org/57521
> >>>>>     - Cluster graph: http://imagebin.org/57525
> >>>>>
> >>>>
> >>>> That is a not insignificant, but cumulative across the cluster its not
> >>>> much.
> >>>>
> >>>>> iostat (disk) (sampled most of the nodes, below values are ranges I
> saw):
> >>>>>     tps: 0.41-1.27
> >>>>>     Blk_read/s: 46-58
> >>>>>     Blk_wrtn/s: 20-23
> >>>>> (have two disks per node, both SAS, 10k RPM)
> >>>>>
> >>>>
> >>>> Did you do iostat with a parameter to have it spit out more than one
> row?
> >>>> By default, it spits out data averaged since boot time, like vmstat.
> >>>> My favorite iostat params for monitoring are:
> >>>> iostat -mx 5
> >>>> iostat -dmx 5
> >>>> (or 10 or 15 or 60 second intervals depending on what I'm doing)
>  Ganglia
> >>>> might have some I/O info -- you want both iops and some sort of
> bytes/sec
> >>>> measurement.
> >>>>
> >>>>> ---
> >>>>> Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second?
> >>>>>
> >>>>
> >>>> I think its the 512 byte block notion, but I always use -m to put it
> in
> >>>> useful units.
> >>>>
> >>>>> Also, from the job page (different job, same Map method, just more
> >>>>> data...~40GB. 781 files):
> >>>>> Map input records       629,738,080
> >>>>> Map input bytes         41,538,992,880
> >>>>>
> >>>>> Anything else I can look into?
> >>>>
> >>>> Based on your other email:
> >>>>
> >>>> There are almost 800 map tasks, these seem to mostly be data local.
>  The
> >>>> current implementation of the JobTracker schedules rather slowly, and
> can
> >>>> at
> >>>> best place one new task per node per 2 seconds or so on a small
> cluster.
> >>>> So, with 10 servers, it will take at least 80 seconds just to schedule
> all
> >>>> the tasks.
> >>>> If each server can run 8 tasks concurrently, then if the average task
> >>>> doesn't take somewhat longer than 16 seconds, the system will not
> reach
> >>>> full
> >>>> utilization.
> >>>>
> >>>> What does the web interface tell you about the number of concurrent
> map
> >>>> tasks during the run?  Does it approach the max task slots?
> >>>>
> >>>> You can look at the logs for an individual task, and see how much data
> it
> >>>> read, and how long it took.  It might be hitting your 50MB/sec or
> close in
> >>>> a
> >>>> burst, or perhaps not.
> >>>>
> >>>> Given the sort of bottlenecks I often see, I suspect the scheduling.
>  But,
> >>>> you have almost maxed CPU use, so its probably not that.  Getting
> stack
> >>>> dumps to see what the processor is doing during your test will help
> narrow
> >>>> it down.
> >>>>
> >>>>
> >>>>>
> >>>>> Do my original numbers (only 2x performance) jump out at you as being
> >>>>> way off? Or it is common to see that a setup similar to mine?
> >>>>>
> >>>>> I should also note that given its a custom binary format, I do not
> >>>>> support Splitting (isSplittable() is false). I don't think that would
> >>>>> count for such a large discrepancy in expected performance, would it?
> >>>>>
> >>>>
> >>>> If the files are all larger than the block size, it would cause a lot
> more
> >>>> network activity -- but unless your switch or network is broken or not
> >>>> gigabit -- there is a lot of capacity left in the network.
> >>>>
> >>>>> Thanks for the help,
> >>>>> Will
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 28, 2009 at 12:58 PM, Scott Carey<
> [email protected]>
> >>>>> wrote:
> >>>>>> Well, the first thing to do in any performance bottleneck
> investigation
> >>>>>> is
> >>>>>> to look at the machine hardware resource usage.
> >>>>>>
> >>>>>> During your test, what is the CPU use and disk usage?  What about
> network
> >>>>>> utilization?
> >>>>>> Top, vmstat, iostat, and some network usage monitoring would be
> useful.
> >>>>>>  It
> >>>>>> could be many things causing your lack of scalability, but without
> >>>>>> actually
> >>>>>> monitoring your machines to see if there is an obvious bottleneck
> its
> >>>>>> just
> >>>>>> random guessing and hunches.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 7/28/09 8:18 AM, "william kinney" <[email protected]>
> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Thanks in advance for the help!
> >>>>>>>
> >>>>>>> I have a performance question relating to how fast I can expect
> Hadoop
> >>>>>>> to scale. Running Cloudera's 0.18.3-10.
> >>>>>>>
> >>>>>>> I have custom binary format, which is just Google Protocol Buffer
> >>>>>>> (protobuf) serialized data:
> >>>>>>>
> >>>>>>>   669 files, ~30GB total size (ranging 10MB to 100MB each).
> >>>>>>>   128MB block size.
> >>>>>>>   10 Hadoop Nodes.
> >>>>>>>
> >>>>>>> I tested my InputFormat and RecordReader for my input format, and
> it
> >>>>>>> showed about 56MB/s performance (single thread, no hadoop, passed
> in
> >>>>>>> test file via FileInputFormat instead of FSDataInputStream) on
> >>>>>>> hardware similar to what I have in my cluster.
> >>>>>>> I also then tested some simple Map logic along w/ the above, and
> got
> >>>>>>> around 54MB/s. I believe that difference can be accounted for
> parsing
> >>>>>>> the protobuf data into java objects.
> >>>>>>>
> >>>>>>> Anyways, when I put this logic into a job that has
> >>>>>>>   - no reduce (.setNumReduceTasks(0);)
> >>>>>>>   - no emit
> >>>>>>>   - just protobuf parsing calls (like above)
> >>>>>>>
> >>>>>>> I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.
> >>>>>>>
> >>>>>>> So my question, why is the rate only 2x what I see on a single
> thread,
> >>>>>>> non-hadoop test? Would it not be:
> >>>>>>>   54MB/s x 10 (Num Nodes) - small hadoop overhead ?
> >>>>>>>
> >>>>>>> Is there any area of my configuration I should look into for
> tuning?
> >>>>>>>
> >>>>>>> Anyway I could get more accurate performance monitoring of my job?
> >>>>>>>
> >>>>>>> On a side note, I tried the same job after combining the files into
> >>>>>>> about 11 files (still 30GB in size), and actually saw a decrease in
> >>>>>>> performance (~90MB/s).
> >>>>>>>
> >>>>>>> Any help is appreciated. Thanks!
> >>>>>>>
> >>>>>>> Will
> >>>>>>>
> >>>>>>> some hadoop-site.xml values:
> >>>>>>> dfs.replication  3
> >>>>>>> io.file.buffer.size   65536
> >>>>>>> dfs.datanode.handler.count  3
> >>>>>>> mapred.tasktracker.map.tasks.maximum  6
> >>>>>>> dfs.namenode.handler.count  5
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Re: Map performance with custom binary format

Reply via email to