[jira] Resolved: (HADOOP-3522) ValuesIterator.next() doesn't return a new object, thus failing many equals() tests.

Owen O'Malley (JIRA) Tue, 10 Jun 2008 09:59:45 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Owen O'Malley resolved HADOOP-3522.
-----------------------------------

    Resolution: Won't Fix

This is working as designed. 

   1.  Return a new object each time for next(). This might have significant 
overhead.
   2. Return a new object for unknown object types and reuse the same object 
for known types (like Text). Remove "if (this==anObject) return true;" check 
from all equals() methods for known objects.
   3. Document clearly that all user-defined classes must implement an equals() 
method, which doesn't do the "if (this==anObject) return true;" check (ie. push 
the problem to the user).


 Description    
   
Problem
The ValuesIterator.next() doesn't return a new object representing the next 
element. It is expected that this is the case. Example from java.lang.String 
source code:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        // custom code here
    }
}

Changing the private fields of the object and returning it from next() will 
make this object fail the equals() test, because this==anObject will always be 
true.
What is affected
The reduce() method presents (a subclass of) ValuesIterator to the user. So 
every user-defined class can be affected. Manifestations of this bug in 0.17.0:

    * The Text class checks for equality similarly to String.equals(), which is 
shown above.
    * The contrib/data_join breaks because it stores tags in a Map. The 
behavior of the next() method makes Object.equals() be true for all tags.

JUnit test
Patch against hadoop-0.17.0:

Index: src/test/org/apache/hadoop/mapred/TestReduceTask.java
===================================================================
--- src/test/org/apache/hadoop/mapred/TestReduceTask.java       (revision 
665935)
+++ src/test/org/apache/hadoop/mapred/TestReduceTask.java       (working copy)
@@ -122,4 +122,52 @@
       runValueIterator(tmpDir, testCase, conf);
     }
   }
+
+  public void runValueIterator2(Path tmpDir, Pair[] vals, 
+                               Configuration conf) throws IOException {
+    FileSystem fs = tmpDir.getFileSystem(conf);
+    Path path = new Path(tmpDir, "data.in");
+    SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
+                                                         Text.class, 
+                                                         Text.class);
+    for(Pair p: vals) {
+      writer.append(new Text(p.key), new Text(p.value));
+    }
+    writer.close();
+    SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, Text.class, 
+                                                         Text.class, conf);
+    SequenceFile.Sorter.RawKeyValueIterator rawItr = 
+      sorter.merge(new Path[]{path}, false, tmpDir);
+    @SuppressWarnings("unchecked") // WritableComparators are not generic
+    ReduceTask.ValuesIterator<Text,Text> valItr = 
+      new ReduceTask.ValuesIterator<Text,Text>(rawItr,
+          WritableComparator.get(Text.class), Text.class, Text.class,
+          conf, new NullProgress());
+    while (valItr.more()) {
+      Object key = valItr.getKey();
+      String keyString = key.toString();
+      // make sure it matches!
+      assertEquals(vals[0].key, keyString);
+      // must have at least 1 value!
+      assertTrue(valItr.hasNext());
+      Text value1 = valItr.next();
+      // must have at least 2 values!
+      assertTrue(valItr.hasNext());
+      Text value2 = valItr.next();
+      // do test
+      assertNotSame(value1, value2);
+      assertTrue(!value1.equals(value2));
+      assertTrue(!value2.equals(value1));
+      // make sure the key hasn't changed under the hood
+      assertEquals(keyString, valItr.getKey().toString());
+      valItr.nextKey();
+    }
+  }
+
+  public void testValueIterator2() throws Exception {
+    Path tmpDir = new Path("build/test/test.reduce.task");
+    Configuration conf = new Configuration();
+    Pair[] test = new Pair[]{ new Pair("k1", "v1"), new Pair("k1", "v2") };
+    runValueIterator2(tmpDir, test, conf);
+  }
 }

A program with a bug
I would imagine that a programmer would be really confused when everything is 
equal in the example below, for any text input:

package test;

import java.io.*;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;

public class TestNewReducer extends Configured implements Tool {
    public int run(String[] args) throws Exception {
        if (args.length < 2) {
            System.out.println("TestNewReducer <inDir> <outDir>");
            return -1;
        }

        JobConf conf = new JobConf(getConf(), TestNewReducer.class);
        conf.setJobName("test-newreducer");

        conf.setInputFormat(TextInputFormat.class);

        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);

        conf.setOutputFormat(TextOutputFormat.class);
        conf.setOutputKeyClass(LongWritable.class);
        conf.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new TestNewReducer(), 
args);
        System.exit(res);
    }

    public static class Map extends MapReduceBase implements Mapper 
<LongWritable, Text, LongWritable, Text> {
        private final static LongWritable nill = new LongWritable(1);

        public void map (LongWritable key, Text value,
                OutputCollector<LongWritable, Text> output,
                Reporter reporter) throws IOException {
            output.collect(nill, value);
        }
    }

    public static class Reduce extends MapReduceBase implements 
Reducer<LongWritable, Text, LongWritable, Text> {
        private final static LongWritable nill = new LongWritable(1);

        public void reduce(LongWritable key, Iterator<Text> values,
                OutputCollector<LongWritable, Text> output,
                Reporter reporter) throws IOException {
            Text t1 = null;
            if (values.hasNext())
                t1 = values.next(); // t1 is the first element in values

            while(values.hasNext()) {
                // is this element equal to t1?
                Text t = values.next();
                output.collect(nill, t.equals(t1) ? new Text("Equal") : new 
Text("Not equal"));
            }
        }
    }
}

{quote}
Possible fixes
   1. Return a new object each time for next(). This might have significant 
overhead.
{quote}
This is the problem that was fixed. Forcing the inner loops of map/reduce to 
allocate objects is a huge cost.

{quote}
   2. Return a new object for unknown object types and reuse the same object 
for known types (like Text). Remove "if (this==anObject) return true;" check 
from all equals() methods for known objects.
{quote}
No, this will make the semantics much more confusing.

{quote}
   3. Document clearly that all user-defined classes must implement an equals() 
method, which doesn't do the "if (this==anObject) return true;" check (ie. push 
the problem to the user).
{quote}
I don't understand your point. *Any* equals function must return true if this 
== anObject.

It *should* be documented better in the reduce interface that the iterator 
returns the same (modified) value each time.

> ValuesIterator.next() doesn't return a new object, thus failing many equals() 
> tests.
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3522
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3522
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>            Reporter: Spyros Blanas
>
> h2.Problem
> The ValuesIterator.next() doesn't return a new object representing the next 
> element. 
> It is expected that this is the case. Example from java.lang.String source 
> code:
> {noformat}
> public boolean equals(Object anObject) {
>     if (this == anObject) {
>         return true;
>     }
>     if (anObject instanceof String) {
>         // custom code here
>     }
> }
> {noformat}
> Changing the private fields of the object and returning it from next() will 
> make this object fail the equals() test, because this==anObject will always 
> be true.
> h2. What is affected
> The reduce() method presents (a subclass of) ValuesIterator to the user. So 
> every user-defined class can be affected.
> Manifestations of this bug in 0.17.0:
> * The Text class checks for equality similarly to String.equals(), which is 
> shown above.
> * The contrib/data_join breaks because it stores tags in a Map. The behavior 
> of the next() method makes Object.equals() be true for all tags.
> h2. JUnit test
> Patch against hadoop-0.17.0:
> {noformat}
> Index: src/test/org/apache/hadoop/mapred/TestReduceTask.java
> ===================================================================
> --- src/test/org/apache/hadoop/mapred/TestReduceTask.java       (revision 
> 665935)
> +++ src/test/org/apache/hadoop/mapred/TestReduceTask.java       (working copy)
> @@ -122,4 +122,52 @@
>        runValueIterator(tmpDir, testCase, conf);
>      }
>    }
> +
> +  public void runValueIterator2(Path tmpDir, Pair[] vals, 
> +                               Configuration conf) throws IOException {
> +    FileSystem fs = tmpDir.getFileSystem(conf);
> +    Path path = new Path(tmpDir, "data.in");
> +    SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,
> +                                                         Text.class, 
> +                                                         Text.class);
> +    for(Pair p: vals) {
> +      writer.append(new Text(p.key), new Text(p.value));
> +    }
> +    writer.close();
> +    SequenceFile.Sorter sorter = new SequenceFile.Sorter(fs, Text.class, 
> +                                                         Text.class, conf);
> +    SequenceFile.Sorter.RawKeyValueIterator rawItr = 
> +      sorter.merge(new Path[]{path}, false, tmpDir);
> +    @SuppressWarnings("unchecked") // WritableComparators are not generic
> +    ReduceTask.ValuesIterator<Text,Text> valItr = 
> +      new ReduceTask.ValuesIterator<Text,Text>(rawItr,
> +          WritableComparator.get(Text.class), Text.class, Text.class,
> +          conf, new NullProgress());
> +    while (valItr.more()) {
> +      Object key = valItr.getKey();
> +      String keyString = key.toString();
> +      // make sure it matches!
> +      assertEquals(vals[0].key, keyString);
> +      // must have at least 1 value!
> +      assertTrue(valItr.hasNext());
> +      Text value1 = valItr.next();
> +      // must have at least 2 values!
> +      assertTrue(valItr.hasNext());
> +      Text value2 = valItr.next();
> +      // do test
> +      assertNotSame(value1, value2);
> +      assertTrue(!value1.equals(value2));
> +      assertTrue(!value2.equals(value1));
> +      // make sure the key hasn't changed under the hood
> +      assertEquals(keyString, valItr.getKey().toString());
> +      valItr.nextKey();
> +    }
> +  }
> +
> +  public void testValueIterator2() throws Exception {
> +    Path tmpDir = new Path("build/test/test.reduce.task");
> +    Configuration conf = new Configuration();
> +    Pair[] test = new Pair[]{ new Pair("k1", "v1"), new Pair("k1", "v2") };
> +    runValueIterator2(tmpDir, test, conf);
> +  }
>  }
> {noformat}
> h2. A program with a bug
> I would imagine that a programmer would be really confused when everything is 
> equal in the example below, for any text input:
> {noformat}
> package test;
> import java.io.*;
> import java.util.*;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.util.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.mapred.lib.*;
> public class TestNewReducer extends Configured implements Tool {
>     public int run(String[] args) throws Exception {
>         if (args.length < 2) {
>             System.out.println("TestNewReducer <inDir> <outDir>");
>             return -1;
>         }
>         JobConf conf = new JobConf(getConf(), TestNewReducer.class);
>         conf.setJobName("test-newreducer");
>         conf.setInputFormat(TextInputFormat.class);
>         conf.setMapperClass(Map.class);
>         conf.setReducerClass(Reduce.class);
>         conf.setOutputFormat(TextOutputFormat.class);
>         conf.setOutputKeyClass(LongWritable.class);
>         conf.setOutputValueClass(Text.class);
>         FileInputFormat.setInputPaths(conf, new Path(args[0]));
>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>         JobClient.runJob(conf);
>         return 0;
>     }
>     public static void main(String[] args) throws Exception {
>         int res = ToolRunner.run(new Configuration(), new TestNewReducer(), 
> args);
>         System.exit(res);
>     }
>     public static class Map extends MapReduceBase implements Mapper 
> <LongWritable, Text, LongWritable, Text> {
>         private final static LongWritable nill = new LongWritable(1);
>         public void map (LongWritable key, Text value,
>                 OutputCollector<LongWritable, Text> output,
>                 Reporter reporter) throws IOException {
>             output.collect(nill, value);
>         }
>     }
>     public static class Reduce extends MapReduceBase implements 
> Reducer<LongWritable, Text, LongWritable, Text> {
>         private final static LongWritable nill = new LongWritable(1);
>         public void reduce(LongWritable key, Iterator<Text> values,
>                 OutputCollector<LongWritable, Text> output,
>                 Reporter reporter) throws IOException {
>             Text t1 = null;
>             if (values.hasNext())
>                 t1 = values.next(); // t1 is the first element in values
>             while(values.hasNext()) {
>                 // is this element equal to t1?
>                 Text t = values.next();
>                 output.collect(nill, t.equals(t1) ? new Text("Equal") : new 
> Text("Not equal"));
>             }
>         }
>     }
> }
> {noformat}
> h2. Possible fixes
> # Return a new object each time for next(). This might have significant 
> overhead.
> # Return a new object for unknown object types and reuse the same object for 
> known types (like Text). Remove "if (this==anObject) return true;" check from 
> all equals() methods for known objects.
> # Document clearly that all user-defined classes must implement an equals() 
> method, which doesn't do the "if (this==anObject) return true;" check (ie. 
> push the problem to the user).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-3522) ValuesIterator.next() doesn't return a new object, thus failing many equals() tests.

Reply via email to