Re: sort speeds under java, c++, and streaming

Doug Judd Thu, 08 Nov 2007 20:39:48 -0800

Thanks, Owen.  Did it look like the system was CPU bound?  It would be
interesting to see some top output for the various runs.  It would also be
interesting to profile the Java stuff in both Pipes mode and non-Pipes mode.


- Doug

On Nov 8, 2007 7:00 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

>
> On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:
>
> > Can you provide more details of your test?
>
> Sure, I guess I should have been more specific to start with. *grin*
>
> The data was generated with:
> bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
> gridmix-text.xml\
>    -outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
> data/sort/text
> contents of gridmix-text.xml:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>
> <configuration>
>
> <property>
>   <name>test.randomtextwrite.total_bytes</name>
>   <value>429496729600</value>
> </property>
>
> <property>
>   <name>test.randomtextwrite.min_words_key</name>
>   <value>1</value>
> </property>
>
> <property>
>   <name>test.randomtextwrite.max_words_key</name>
>   <value>10</value>
> </property>
>
> <property>
>   <name>test.randomtextwrite.min_words_value</name>
>   <value>0</value>
> </property>
>
> <property>
>   <name>test.randomtextwrite.max_words_value</name>
>   <value>200</value>
> </property>
>
> </configuration>
>
> And then ran the sort as:
>
> Java:
> bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
>    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
>    -outFormat org.apache.hadoop.mapred.TextOutputFormat \
>    -outKey org.apache.hadoop.io.Text -outValue
> org.apache.hadoop.io.Text \
>    /gridmix/data/sort/text/part-*0 java-out
>
> Pipes:
> bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
> out \
>   -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
>   -program /gridmix/programs/pipes-sort -reduces 78 \
>   -jobconf\
>
> mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
> ass=org.apache.hadoop.io.Text \
>   -writer org.apache.hadoop.mapred.TextOutputFormat
>
> Streaming:
> bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
>   -input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
> cat -reducer cat \
>   -numReduceTasks 78
>
> Note that these are the commands I used, although they generate 400gb
> data and then only sort 10%. Clearly, it is a bit faster to just
> generate 40gb and sort all of it. I'm just going to run the bigger
> sort in the next couple of days.
>
> > In particular what was the Java
> > Map-reduce program that your ran?  Was it
> > src/examples/org/apache/hadoop/examples/Sort.java ?
>
> Yes
>
> > Also, I can't find anything called "RandomTextWriter" in the source
> > tarball, can you point me to it?
>
> It is in the example directory of 0.15 too. The only remaining piece,
> is the pipes sort program and I'll upload that to HADOOP-2127.
>
> -- Owen
>

Re: sort speeds under java, c++, and streaming

Reply via email to