Re: Map performance with custom binary format

william kinney Tue, 28 Jul 2009 14:40:33 -0700

                                     Counter                    Map     
     Reduce     Total
File Systems                 HDFS bytes read          41,538,992,880    0
                41,538,992,880


                                     Rack-local map tasks       0       
0               49
Job Counters                  Launched map tasks        0                       0
                794
                                     Data-local map tasks       0       
       0                732

                                     Map input records  629,738,080     0
                629,738,080
Map-Reduce Framework  Map input bytes           41,538,992,880  0       
   41,538,992,880
                                     Map output records         0       
     0          0

The 50MB/s was not on a hadoop node, but rather a local java command
line program that called the RecordReader with a FileInputStream of a
test file (~ 100MB, taken from one of the files on the hdfs used in
the job) and looped through it (ie, while((bytesRead =
gbr.readGPB(BytesWritable, LongWritable)) > 0) ). I then did the
protobuf parsing as it appears in my Hadoop job map method. Single
thread. performance was ~50MB/s. Ran it locally on one of the boxes
that Hadoop is on (to ensure same hardware and JVM).
So, it wasn't already in memory, but rather read from disk via
FileInputStream (Didn't use BufferedInputStream).
Hardware is pretty beefy, Dual Core Xeon 2.6Ghz, 2 x 10K SAS, 8GB RAM.

Sun JVM "1.6.0_13", 64-bit HotSpot.

On Tue, Jul 28, 2009 at 4:25 PM, Ted Dunning<[email protected]> wrote:
> On Tue, Jul 28, 2009 at 12:15 PM, william kinney
> <[email protected]>wrote:
>
>>
>> Also, from the job page (different job, same Map method, just more
>> data...~40GB. 781 files):
>> Map input records       629,738,080
>> Map input bytes         41,538,992,880
>>
>> Anything else I can look into?
>
>
> Yes.  The number of data local maps and how many maps total.
>
>
>> Do my original numbers (only 2x performance) jump out at you as being
>> way off? Or it is common to see that a setup similar to mine?
>
>
> It is way off.  My experience is that from 5 EC2 nodes, I can sustain
> 100-200MB / s to the *network*.  These are lesser machines than you have and
> you have twice as many.  Moreover, your test program is nicely designed to
> avoid all of the overhead attendant on running a full program.  It is
> reasonable to expect significant slow down due to startup and due to going
> through HDFS, but for local blocks I would expect good performance.
>
> Is it possible that the 50MB/s on a single node was not a real number?  It
> seems somewhat high but probably reasonable with modern hardware.  Was the
> file already in memory?
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Map performance with custom binary format

Reply via email to