Re: Map performance with custom binary format

william kinney Thu, 30 Jul 2009 10:43:03 -0700

To simplify things, I removed the protobuf parse and DateFormat calls.
Local execution I noticed was now over 200MB/s....which is much
greater than my physical disk performance (hdparm -t shows ~80MB/s). I
"sync"ed and flush the RAM cache (echo 3 > /proc/sys/vm/drop_caches),
and now get around 110MB/s. Still higher than the disk physical limit,
so I'm quite puzzled.


When running this pared down Map job, with 22 ~500MB files (~11GB), I
see a total job performance of 364.624 MB/s (finished in 31 seconds).

When looking at the TaskTracker$Child logs, to see how long each
process was taking per file, I saw 30-37 MB/s.

So I guess two things:
  - The total job performance/time is very good, and looks to scale
with my 10 nodes (34MB * 10?). Should it have scaled even better,
since I have 2 TT Children running per node?
  - The child performance is still much less than what I'm seeing in a
local process (110MB/s vs ~30 MB/s). I'll need to profile some of the
processes to find out where it's spending the time, correct? Any other
ideas?

Thanks!
Will



On Thu, Jul 30, 2009 at 10:37 AM, william
kinney<[email protected]> wrote:
> Local is executed on a Hadoop node (when no job is running), So same
> JRE/hardware.
>
> JRE:
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
>
> JVM arguments for child task:
> /usr/java/jdk1.6.0_13/jre/bin/java
> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work
> -Xmx486m 
> -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work/tmp
> -classpath 
> /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/hadoop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons-net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairscheduler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar:/usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/hadoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-compiler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/lib/jetty-ext/jsp-api.jar::/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars/classes:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars:/disk2/hadoop/mapred/local/taskTracker/archive/master.ph-cloud.detica.us.com/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/ph-gpb-1.0.5.jar/ph-gpb-1.0.5.jar:/disk2/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/protobuf-java-2.0.3.jar/protobuf-java-2.0.3.jar:/disk1/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/commons-lang-2.4.jar/commons-lang-2.4.jar:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work
> -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.root.logger=INFO,TLA
> -Dhadoop.tasklog.taskid=attempt_200907242015_0048_m_000008_0
> -Dhadoop.tasklog.totalLogFileSize=0
> org.apache.hadoop.mapred.TaskTracker$Child 127.0.0.1 52698
> attempt_200907242015_0048_m_000008_0
>
> child task heap: -Xmx486m
>
> (mapred.child.ulimit is 995328)
>
> Local call has no JVM arguments, just:
> java -cp <myjar>.jar com......RecordReaderTest <fileToTest>
>
> Hardware platform:
> Linux 2.6.27.24-170.2.68.fc10.x86_64
> 8GB
> Dual Core Xeon E5205
> 2 x 300GB SAS 10k, no RAID.
> (HP ProLiant blade)
>
> Data is not compressed.
>
> JobTracker:
> Running: Started around 20, but as the job progressed it slowly
> increased to at the end: 432 (when Pending was 0). Running dropped to
> 0/Status was marked Succeeded about 10 seconds after that. Is this
> normal? The total # of Tasks was 1449.
>
> Stack Traces.
> Looked at about 20 stack traces from 2 different nodes. Consistently saw:
> 2 x org.apache.hadoop.dfs.DFSClient$LeaseChecker @ Thread.sleep()
> "Comm thread for attempt_200907242015_0050_m_001409_0" @ Thread.sleep()
> "IPC Client (47) connection to <master-hostname>/192.168.1.100:8020
> from wkinney" @ Object.wait()
> "IPC Client (47) connection to /127.0.0.1:49202 from an unknown user"
> @ Object.wait()
> VM, GC, Signal Dispatcher, Low Memory Detector, CompilerThread,
> Finalizer, Reference Handler...
>
> Then would sometimes see FastDateFormat thread, parseFrom(), or
> somewhere near there (e.g. MapRunner.run())
>
> Finally, I consistently saw this:
> "Thread-5" daemon prio=10 tid=0x0000000040bbfc00 nid=0x2f87 in
> Object.wait() [0x00007fb7498ce000..0x00007fb7498cebf0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00007fb769fdec00> (a java.util.LinkedList)
>        at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1905)
>        - locked <0x00007fb769fdec00> (a java.util.LinkedList)
> I'm guessing this is normal DataNode activity...
>
> Will
>
>
> On Thu, Jul 30, 2009 at 1:31 AM, Scott Carey<[email protected]> wrote:
>> What is the JRE for the Hadoop nodes versus local?  What are the JVM 
>> arguments for the child tasks and the local version (and heap size)?  What 
>> is the hardware and platform details for the nodes versus the local test?
>> Is the data compressed in Hadoop (check the config)?
>>
>> You mention the TaskTracker web inerface during a job, but what about the 
>> JobTracker interface?  This should show the global view of currently 
>> scheduled maps versus total slots.
>>
>> Lastly, check out some more stack traces on the tasks.  If they are all 
>> still in the DateFormat stuff?  Surely some of them should be in your 
>> parseFrom() method too?
>>
>>
>> On 7/29/09 9:07 PM, "william kinney" <[email protected]> wrote:
>>
>> OK:
>>  implemented some iotop/iostat monitoring in ganglia. Looks pretty
>> standard (job was 23:00 to 23:06):
>>   - Single Node Disk Read: http://imagebin.org/57716
>>   - Single Node Disk Write: http://imagebin.org/57717
>>
>> On each node, noticed that the two TaskTracker$Child processes were
>> consuming close to 90% of each core. The TaskTracker and DataNode were
>> close to 0%. For the TT children, I did jstack dumps, but didn't
>> really see much that popped out other than a lot of time spent in a
>> SimpleDateFormat section and the protobuf parse. I switched the SDF
>> out with commons.lang FastDateFormat, which reduced the total time for
>> both the Hadoop job and the local/non-hadoop test, so still a
>> discrepancy between local and hadoop runs.
>>
>> "You can look at the logs for an individual task, and see how much data it
>> read, and how long it took.  It might be hitting your 50MB/sec or close in a
>> burst, or perhaps not."
>>  - I decided to log the performance of each RecordReader use within
>> hadoop, which is essentially 1:1 for TaskTracker$Child process since I
>> have 1 InputSplit per file (ie, no splitting), right?. Saw:
>> Example 1) 527639090 bytes in : 18050 ms. (27.8778 MB/s)
>> Example 2) 533770314 bytes in : 23494 ms. (21.6669 MB/s)
>> Example 3) 529711886 bytes in : 20092 ms. (25.1429 MB/s)
>> ...etc
>> For reference, the non-hadoop/local test:
>> 530710906 bytes in : 9133 ms. (55.41721 MB/s)
>>
>> Regarding the JobTracker only doing 1 task / node / 2 seconds, that
>> will definitely hurt. Although the above discrepancy takes priority
>> for me, for now.
>>
>> "What does the web interface tell you about the number of concurrent map
>> tasks during the run?  Does it approach the max task slots?"
>>  - Yeah it definitely does, from the TaskTracker page on each node,
>> I'm seeing almost always 2 "RUNNING" tasks (and an accumulating list
>> of "COMMIT_PENDING" tasks under Non-Running, which slowly grows as the
>> job progresses). Normal?
>>
>> Also, I used a profiler to profile a local/non-hadoop test of the
>> RecordReader/Map():
>>  class: %Time
>>      org.apache.commons.lang.time.FastDateFormat.format(long):  46%
>>      com......parseFrom(byte[]):  42%
>>      java.io.FileInputStream.read(byte[], int, int): 5%
>>      ...rest are 1%'ish
>>  I guess this doesn't show anything helpful. I'll try to attach it to
>> hadoop remotely...anyone have any experience doing this w/ YourKit
>> Java Profiler?
>>
>> Anyways, decided to test the "large files" vs "small files" theory again:
>>  Small files (1449 files, ranging 10-100MB. average: 32 MB)
>>    - HDFS bytes read  49,057,491,374
>>    - Map input records  737,850,142
>>    - Finished in: 7mins, 26sec
>>    ... 104.898 MB/s
>>  Large files (22 files, around 500MB. average 514MB)
>>    - HDFS bytes read  11,852,421,152
>>    - Map input records 179,657,432
>>    - Finished in: 1mins, 8sec
>>    ... 166.225 MB/s
>>
>>   Not sure why before the large files were taking longer, perhaps the
>> SimpleDateFormat>FastDateFormat change? Anyways, good to see where I
>> need to take the file sizes too...but still 166 MB is not the rate I
>> was hoping for (given the # of nodes and local performance).
>>
>> So I guess in summary, hadoop TaskTracker$Child processes that are
>> doing the Map() and RecordReader are about 50% slower than the normal,
>> local non-hadoop version. In addition, their rate (~25MB/s) * Num
>> Nodes (10) suggests ~ 250MB/s total job performance, but I'm only
>> seeing ~166MB/s.
>>
>> Will
>>
>> On Tue, Jul 28, 2009 at 6:35 PM, Scott Carey<[email protected]> wrote:
>>> See below:
>>>
>>>
>>> On 7/28/09 12:15 PM, "william kinney" <[email protected]> wrote:
>>>
>>>> Sorry, forgot to include that detail.
>>>>
>>>> Some data from ganglia:
>>>>
>>>>   CPU:
>>>>     - on all 10 nodes, I am seeing for the life of the job 85-95% CPU
>>>> usage, with about 10% of that being "System" CPU, vs "User".
>>>>     - Single node graph: http://imagebin.org/57520
>>>>     - Cluster graph: http://imagebin.org/57523
>>>
>>> Ok, CPU is definitely loaded.  Identify which processes are primarily
>>> responsible (Tasks? Datanode? Tasktracker?) You'll want to make the
>>> processes eating CPU during a run spit out some stack traces to 'profile'
>>> the activity.  Use either the 'jstack' utility with the JDK, or do a 'kill
>>> -3 <pid>' on a java process to spit out the stack trace to stdout.  You'll
>>> want to do this a handful of times on a single job if possible to identify
>>> any trends.
>>>
>>>>
>>>>   Memory:
>>>>     - Memory used before job is about 0.4GB, During job it fluctuates
>>>> up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node
>>>> memory (8GB) is showing as "Cached".
>>>>     - Single node graph: http://imagebin.org/57522
>>>
>>> So the OS is mostly just caching disk files in RAM.
>>>
>>>>
>>>>   Network:
>>>>     - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s.
>>>>     - Single node graph: http://imagebin.org/57521
>>>>     - Cluster graph: http://imagebin.org/57525
>>>>
>>>
>>> That is a not insignificant, but cumulative across the cluster its not much.
>>>
>>>> iostat (disk) (sampled most of the nodes, below values are ranges I saw):
>>>>     tps: 0.41-1.27
>>>>     Blk_read/s: 46-58
>>>>     Blk_wrtn/s: 20-23
>>>> (have two disks per node, both SAS, 10k RPM)
>>>>
>>>
>>> Did you do iostat with a parameter to have it spit out more than one row?
>>> By default, it spits out data averaged since boot time, like vmstat.
>>> My favorite iostat params for monitoring are:
>>> iostat -mx 5
>>> iostat -dmx 5
>>> (or 10 or 15 or 60 second intervals depending on what I'm doing)  Ganglia
>>> might have some I/O info -- you want both iops and some sort of bytes/sec
>>> measurement.
>>>
>>>> ---
>>>> Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second?
>>>>
>>>
>>> I think its the 512 byte block notion, but I always use -m to put it in
>>> useful units.
>>>
>>>> Also, from the job page (different job, same Map method, just more
>>>> data...~40GB. 781 files):
>>>> Map input records       629,738,080
>>>> Map input bytes         41,538,992,880
>>>>
>>>> Anything else I can look into?
>>>
>>> Based on your other email:
>>>
>>> There are almost 800 map tasks, these seem to mostly be data local.  The
>>> current implementation of the JobTracker schedules rather slowly, and can at
>>> best place one new task per node per 2 seconds or so on a small cluster.
>>> So, with 10 servers, it will take at least 80 seconds just to schedule all
>>> the tasks.
>>> If each server can run 8 tasks concurrently, then if the average task
>>> doesn't take somewhat longer than 16 seconds, the system will not reach full
>>> utilization.
>>>
>>> What does the web interface tell you about the number of concurrent map
>>> tasks during the run?  Does it approach the max task slots?
>>>
>>> You can look at the logs for an individual task, and see how much data it
>>> read, and how long it took.  It might be hitting your 50MB/sec or close in a
>>> burst, or perhaps not.
>>>
>>> Given the sort of bottlenecks I often see, I suspect the scheduling.  But,
>>> you have almost maxed CPU use, so its probably not that.  Getting stack
>>> dumps to see what the processor is doing during your test will help narrow
>>> it down.
>>>
>>>
>>>>
>>>> Do my original numbers (only 2x performance) jump out at you as being
>>>> way off? Or it is common to see that a setup similar to mine?
>>>>
>>>> I should also note that given its a custom binary format, I do not
>>>> support Splitting (isSplittable() is false). I don't think that would
>>>> count for such a large discrepancy in expected performance, would it?
>>>>
>>>
>>> If the files are all larger than the block size, it would cause a lot more
>>> network activity -- but unless your switch or network is broken or not
>>> gigabit -- there is a lot of capacity left in the network.
>>>
>>>> Thanks for the help,
>>>> Will
>>>>
>>>>
>>>> On Tue, Jul 28, 2009 at 12:58 PM, Scott Carey<[email protected]> 
>>>> wrote:
>>>>> Well, the first thing to do in any performance bottleneck investigation is
>>>>> to look at the machine hardware resource usage.
>>>>>
>>>>> During your test, what is the CPU use and disk usage?  What about network
>>>>> utilization?
>>>>> Top, vmstat, iostat, and some network usage monitoring would be useful.  
>>>>> It
>>>>> could be many things causing your lack of scalability, but without 
>>>>> actually
>>>>> monitoring your machines to see if there is an obvious bottleneck its just
>>>>> random guessing and hunches.
>>>>>
>>>>>
>>>>>
>>>>> On 7/28/09 8:18 AM, "william kinney" <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks in advance for the help!
>>>>>>
>>>>>> I have a performance question relating to how fast I can expect Hadoop
>>>>>> to scale. Running Cloudera's 0.18.3-10.
>>>>>>
>>>>>> I have custom binary format, which is just Google Protocol Buffer
>>>>>> (protobuf) serialized data:
>>>>>>
>>>>>>   669 files, ~30GB total size (ranging 10MB to 100MB each).
>>>>>>   128MB block size.
>>>>>>   10 Hadoop Nodes.
>>>>>>
>>>>>> I tested my InputFormat and RecordReader for my input format, and it
>>>>>> showed about 56MB/s performance (single thread, no hadoop, passed in
>>>>>> test file via FileInputFormat instead of FSDataInputStream) on
>>>>>> hardware similar to what I have in my cluster.
>>>>>> I also then tested some simple Map logic along w/ the above, and got
>>>>>> around 54MB/s. I believe that difference can be accounted for parsing
>>>>>> the protobuf data into java objects.
>>>>>>
>>>>>> Anyways, when I put this logic into a job that has
>>>>>>   - no reduce (.setNumReduceTasks(0);)
>>>>>>   - no emit
>>>>>>   - just protobuf parsing calls (like above)
>>>>>>
>>>>>> I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.
>>>>>>
>>>>>> So my question, why is the rate only 2x what I see on a single thread,
>>>>>> non-hadoop test? Would it not be:
>>>>>>   54MB/s x 10 (Num Nodes) - small hadoop overhead ?
>>>>>>
>>>>>> Is there any area of my configuration I should look into for tuning?
>>>>>>
>>>>>> Anyway I could get more accurate performance monitoring of my job?
>>>>>>
>>>>>> On a side note, I tried the same job after combining the files into
>>>>>> about 11 files (still 30GB in size), and actually saw a decrease in
>>>>>> performance (~90MB/s).
>>>>>>
>>>>>> Any help is appreciated. Thanks!
>>>>>>
>>>>>> Will
>>>>>>
>>>>>> some hadoop-site.xml values:
>>>>>> dfs.replication  3
>>>>>> io.file.buffer.size   65536
>>>>>> dfs.datanode.handler.count  3
>>>>>> mapred.tasktracker.map.tasks.maximum  6
>>>>>> dfs.namenode.handler.count  5
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Map performance with custom binary format

Reply via email to