[
https://issues.apache.org/jira/browse/HADOOP-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen O'Malley resolved HADOOP-3522.
-----------------------------------
Resolution: Won't Fix
This is working as designed.
1. Return a new object each time for next(). This might have significant
overhead.
2. Return a new object for unknown object types and reuse the same object
for known types (like Text). Remove "if (this==anObject) return true;" check
from all equals() methods for known objects.
3. Document clearly that all user-defined classes must implement an equals()
method, which doesn't do the "if (this==anObject) return true;" check (ie. push
the problem to the user).
Description
Problem
The ValuesIterator.next() doesn't return a new object representing the next
element. It is expected that this is the case. Example from java.lang.String
source code:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
// custom code here
}
}
Changing the private fields of the object and returning it from next() will
make this object fail the equals() test, because this==anObject will always be
true.
What is affected
The reduce() method presents (a subclass of) ValuesIterator to the user. So
every user-defined class can be affected. Manifestations of this bug in 0.17.0:
* The Text class checks for equality similarly to String.equals(), which is
shown above.
* The contrib/data_join breaks because it stores tags in a Map. The
behavior of the next() method makes Object.equals() be true for all tags.
JUnit test
Patch against hadoop-0.17.0:
Index: src/test/org/apache/hadoop/mapred/TestReduceTask.java
===================================================================
--- src/test/org/apache/hadoop/mapred/TestReduceTask.java (revision
665935)
+++ src/test/org/apache/hadoop/mapred/TestReduceTask.java (working copy)
@@ -122,4 +122,52 @@
runValueIterator(tmpDir, testCase, conf);
}
}
+
+ public void runValueIterator2(Path tmpDir, Pair[] vals,
+ Configuration conf) throws IOException {
+ FileSystem fs = tmpDir.getFileSystem(conf);
+ Path path = new Path(tmpDir, "data.in");
+ SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
+ Text.class,
+ Text.class);
+ for(Pair p: vals) {
+ writer.append(new Text(p.key), new Text(p.value));
+ }
+ writer.close();
+ SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, Text.class,
+ Text.class, conf);
+ SequenceFile.Sorter.RawKeyValueIterator rawItr =
+ sorter.merge(new Path[]{path}, false, tmpDir);
+ @SuppressWarnings("unchecked") // WritableComparators are not generic
+ ReduceTask.ValuesIterator<Text,Text> valItr =
+ new ReduceTask.ValuesIterator<Text,Text>(rawItr,
+ WritableComparator.get(Text.class), Text.class, Text.class,
+ conf, new NullProgress());
+ while (valItr.more()) {
+ Object key = valItr.getKey();
+ String keyString = key.toString();
+ // make sure it matches!
+ assertEquals(vals[0].key, keyString);
+ // must have at least 1 value!
+ assertTrue(valItr.hasNext());
+ Text value1 = valItr.next();
+ // must have at least 2 values!
+ assertTrue(valItr.hasNext());
+ Text value2 = valItr.next();
+ // do test
+ assertNotSame(value1, value2);
+ assertTrue(!value1.equals(value2));
+ assertTrue(!value2.equals(value1));
+ // make sure the key hasn't changed under the hood
+ assertEquals(keyString, valItr.getKey().toString());
+ valItr.nextKey();
+ }
+ }
+
+ public void testValueIterator2() throws Exception {
+ Path tmpDir = new Path("build/test/test.reduce.task");
+ Configuration conf = new Configuration();
+ Pair[] test = new Pair[]{ new Pair("k1", "v1"), new Pair("k1", "v2") };
+ runValueIterator2(tmpDir, test, conf);
+ }
}
A program with a bug
I would imagine that a programmer would be really confused when everything is
equal in the example below, for any text input:
package test;
import java.io.*;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
public class TestNewReducer extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length < 2) {
System.out.println("TestNewReducer <inDir> <outDir>");
return -1;
}
JobConf conf = new JobConf(getConf(), TestNewReducer.class);
conf.setJobName("test-newreducer");
conf.setInputFormat(TextInputFormat.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new TestNewReducer(),
args);
System.exit(res);
}
public static class Map extends MapReduceBase implements Mapper
<LongWritable, Text, LongWritable, Text> {
private final static LongWritable nill = new LongWritable(1);
public void map (LongWritable key, Text value,
OutputCollector<LongWritable, Text> output,
Reporter reporter) throws IOException {
output.collect(nill, value);
}
}
public static class Reduce extends MapReduceBase implements
Reducer<LongWritable, Text, LongWritable, Text> {
private final static LongWritable nill = new LongWritable(1);
public void reduce(LongWritable key, Iterator<Text> values,
OutputCollector<LongWritable, Text> output,
Reporter reporter) throws IOException {
Text t1 = null;
if (values.hasNext())
t1 = values.next(); // t1 is the first element in values
while(values.hasNext()) {
// is this element equal to t1?
Text t = values.next();
output.collect(nill, t.equals(t1) ? new Text("Equal") : new
Text("Not equal"));
}
}
}
}
{quote}
Possible fixes
1. Return a new object each time for next(). This might have significant
overhead.
{quote}
This is the problem that was fixed. Forcing the inner loops of map/reduce to
allocate objects is a huge cost.
{quote}
2. Return a new object for unknown object types and reuse the same object
for known types (like Text). Remove "if (this==anObject) return true;" check
from all equals() methods for known objects.
{quote}
No, this will make the semantics much more confusing.
{quote}
3. Document clearly that all user-defined classes must implement an equals()
method, which doesn't do the "if (this==anObject) return true;" check (ie. push
the problem to the user).
{quote}
I don't understand your point. *Any* equals function must return true if this
== anObject.
It *should* be documented better in the reduce interface that the iterator
returns the same (modified) value each time.
> ValuesIterator.next() doesn't return a new object, thus failing many equals()
> tests.
> ------------------------------------------------------------------------------------
>
> Key: HADOOP-3522
> URL: https://issues.apache.org/jira/browse/HADOOP-3522
> Project: Hadoop Core
> Issue Type: Bug
> Affects Versions: 0.17.0
> Reporter: Spyros Blanas
>
> h2.Problem
> The ValuesIterator.next() doesn't return a new object representing the next
> element.
> It is expected that this is the case. Example from java.lang.String source
> code:
> {noformat}
> public boolean equals(Object anObject) {
> if (this == anObject) {
> return true;
> }
> if (anObject instanceof String) {
> // custom code here
> }
> }
> {noformat}
> Changing the private fields of the object and returning it from next() will
> make this object fail the equals() test, because this==anObject will always
> be true.
> h2. What is affected
> The reduce() method presents (a subclass of) ValuesIterator to the user. So
> every user-defined class can be affected.
> Manifestations of this bug in 0.17.0:
> * The Text class checks for equality similarly to String.equals(), which is
> shown above.
> * The contrib/data_join breaks because it stores tags in a Map. The behavior
> of the next() method makes Object.equals() be true for all tags.
> h2. JUnit test
> Patch against hadoop-0.17.0:
> {noformat}
> Index: src/test/org/apache/hadoop/mapred/TestReduceTask.java
> ===================================================================
> --- src/test/org/apache/hadoop/mapred/TestReduceTask.java (revision
> 665935)
> +++ src/test/org/apache/hadoop/mapred/TestReduceTask.java (working copy)
> @@ -122,4 +122,52 @@
> runValueIterator(tmpDir, testCase, conf);
> }
> }
> +
> + public void runValueIterator2(Path tmpDir, Pair[] vals,
> + Configuration conf) throws IOException {
> + FileSystem fs = tmpDir.getFileSystem(conf);
> + Path path = new Path(tmpDir, "data.in");
> + SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
> + Text.class,
> + Text.class);
> + for(Pair p: vals) {
> + writer.append(new Text(p.key), new Text(p.value));
> + }
> + writer.close();
> + SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, Text.class,
> + Text.class, conf);
> + SequenceFile.Sorter.RawKeyValueIterator rawItr =
> + sorter.merge(new Path[]{path}, false, tmpDir);
> + @SuppressWarnings("unchecked") // WritableComparators are not generic
> + ReduceTask.ValuesIterator<Text,Text> valItr =
> + new ReduceTask.ValuesIterator<Text,Text>(rawItr,
> + WritableComparator.get(Text.class), Text.class, Text.class,
> + conf, new NullProgress());
> + while (valItr.more()) {
> + Object key = valItr.getKey();
> + String keyString = key.toString();
> + // make sure it matches!
> + assertEquals(vals[0].key, keyString);
> + // must have at least 1 value!
> + assertTrue(valItr.hasNext());
> + Text value1 = valItr.next();
> + // must have at least 2 values!
> + assertTrue(valItr.hasNext());
> + Text value2 = valItr.next();
> + // do test
> + assertNotSame(value1, value2);
> + assertTrue(!value1.equals(value2));
> + assertTrue(!value2.equals(value1));
> + // make sure the key hasn't changed under the hood
> + assertEquals(keyString, valItr.getKey().toString());
> + valItr.nextKey();
> + }
> + }
> +
> + public void testValueIterator2() throws Exception {
> + Path tmpDir = new Path("build/test/test.reduce.task");
> + Configuration conf = new Configuration();
> + Pair[] test = new Pair[]{ new Pair("k1", "v1"), new Pair("k1", "v2") };
> + runValueIterator2(tmpDir, test, conf);
> + }
> }
> {noformat}
> h2. A program with a bug
> I would imagine that a programmer would be really confused when everything is
> equal in the example below, for any text input:
> {noformat}
> package test;
> import java.io.*;
> import java.util.*;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.util.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.mapred.lib.*;
> public class TestNewReducer extends Configured implements Tool {
> public int run(String[] args) throws Exception {
> if (args.length < 2) {
> System.out.println("TestNewReducer <inDir> <outDir>");
> return -1;
> }
> JobConf conf = new JobConf(getConf(), TestNewReducer.class);
> conf.setJobName("test-newreducer");
> conf.setInputFormat(TextInputFormat.class);
> conf.setMapperClass(Map.class);
> conf.setReducerClass(Reduce.class);
> conf.setOutputFormat(TextOutputFormat.class);
> conf.setOutputKeyClass(LongWritable.class);
> conf.setOutputValueClass(Text.class);
> FileInputFormat.setInputPaths(conf, new Path(args[0]));
> FileOutputFormat.setOutputPath(conf, new Path(args[1]));
> JobClient.runJob(conf);
> return 0;
> }
> public static void main(String[] args) throws Exception {
> int res = ToolRunner.run(new Configuration(), new TestNewReducer(),
> args);
> System.exit(res);
> }
> public static class Map extends MapReduceBase implements Mapper
> <LongWritable, Text, LongWritable, Text> {
> private final static LongWritable nill = new LongWritable(1);
> public void map (LongWritable key, Text value,
> OutputCollector<LongWritable, Text> output,
> Reporter reporter) throws IOException {
> output.collect(nill, value);
> }
> }
> public static class Reduce extends MapReduceBase implements
> Reducer<LongWritable, Text, LongWritable, Text> {
> private final static LongWritable nill = new LongWritable(1);
> public void reduce(LongWritable key, Iterator<Text> values,
> OutputCollector<LongWritable, Text> output,
> Reporter reporter) throws IOException {
> Text t1 = null;
> if (values.hasNext())
> t1 = values.next(); // t1 is the first element in values
> while(values.hasNext()) {
> // is this element equal to t1?
> Text t = values.next();
> output.collect(nill, t.equals(t1) ? new Text("Equal") : new
> Text("Not equal"));
> }
> }
> }
> }
> {noformat}
> h2. Possible fixes
> # Return a new object each time for next(). This might have significant
> overhead.
> # Return a new object for unknown object types and reuse the same object for
> known types (like Text). Remove "if (this==anObject) return true;" check from
> all equals() methods for known objects.
> # Document clearly that all user-defined classes must implement an equals()
> method, which doesn't do the "if (this==anObject) return true;" check (ie.
> push the problem to the user).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.