On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:

Can you provide more details of your test?

Sure, I guess I should have been more specific to start with. *grin*

The data was generated with:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf gridmix-text.xml\ -outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/ data/sort/text
contents of gridmix-text.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<configuration>

<property>
  <name>test.randomtextwrite.total_bytes</name>
  <value>429496729600</value>
</property>

<property>
  <name>test.randomtextwrite.min_words_key</name>
  <value>1</value>
</property>

<property>
  <name>test.randomtextwrite.max_words_key</name>
  <value>10</value>
</property>

<property>
  <name>test.randomtextwrite.min_words_value</name>
  <value>0</value>
</property>

<property>
  <name>test.randomtextwrite.max_words_value</name>
  <value>200</value>
</property>

</configuration>

And then ran the sort as:

Java:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
   -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
   -outFormat org.apache.hadoop.mapred.TextOutputFormat \
-outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text \
   /gridmix/data/sort/text/part-*0 java-out

Pipes:
bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe- out \
  -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
  -program /gridmix/programs/pipes-sort -reduces 78 \
  -jobconf\
mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl ass=org.apache.hadoop.io.Text \
  -writer org.apache.hadoop.mapred.TextOutputFormat

Streaming:
bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
-input /gridmix/data/sort/text/part-*0 -output stream-out -mapper cat -reducer cat \
  -numReduceTasks 78

Note that these are the commands I used, although they generate 400gb data and then only sort 10%. Clearly, it is a bit faster to just generate 40gb and sort all of it. I'm just going to run the bigger sort in the next couple of days.

In particular what was the Java
Map-reduce program that your ran?  Was it
src/examples/org/apache/hadoop/examples/Sort.java ?

Yes

Also, I can't find anything called "RandomTextWriter" in the source tarball, can you point me to it?

It is in the example directory of 0.15 too. The only remaining piece, is the pipes sort program and I'll upload that to HADOOP-2127.

-- Owen

Reply via email to