[ https://issues.apache.org/jira/browse/HADOOP-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547105 ]
Hudson commented on HADOOP-2234: -------------------------------- Integrated in Hadoop-Nightly #318 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/318/]) > [hbase] TableInputFormat erroneously aggregates map values > ---------------------------------------------------------- > > Key: HADOOP-2234 > URL: https://issues.apache.org/jira/browse/HADOOP-2234 > Project: Hadoop > Issue Type: Bug > Components: contrib/hbase > Reporter: stack > Assignee: stack > Priority: Minor > Attachments: 2234.patch > > > Edward Yoon reports the following phenomeon: > Given a table: > {code} > [21:38] <edward__> row1 a: <aa> b: <bb> a:ca <aa2> > [21:38] <edward__> row2 a: <aa3> b: <bb3> > [21:38] <edward__> row3 a: <aa4> b: <bb4> > {code} > This map code: > {code} > public void map(WritableComparable key, Writable value, > OutputCollector output, Reporter reporter) throws IOException { > if (m_collector.collector == null) { > m_collector.collector = output; > } > HStoreKey hKey = (HStoreKey) key; > MapWritable newValue = (MapWritable) value; > newValue.put(new Text("row:" + hKey.getRow().toString()), new > ImmutableBytesWritable(hKey.getRow().toString().getBytes())); > > Map<Text, String> log = new HashMap<Text, String>(); > for(Map.Entry<Writable, Writable> e : newValue.entrySet()) { > log.put(e.getKey(), e.getValue()); //abbreviation code. > } > > LOG.info(log); > output.collect(hKey, newValue); > } > {code} > ... produces the following. > {code} > 07/11/20 14:07:53 INFO jvm.JvmMetrics: Initializing JVM Metrics with > processName=JobTracker, sessionId= > 07/11/20 14:07:53 WARN mapred.JobClient: No job jar file set. User classes > may not be found. See JobConf(Class) or JobConf#setJar(String). > 07/11/20 14:07:53 INFO mapred.MapTask: numReduceTasks: 1 > 07/11/20 14:07:53 INFO algebra.SortMap: {a:=aa, b:=bb, a:da=aa44, a:ca=aa2} > 07/11/20 14:07:53 INFO algebra.SortMap: {a:=aa3, b:=bb3, a:da=aa44, a:ca=aa2} > 07/11/20 14:07:53 INFO algebra.SortMap: {a:=aa4, b:=bb4, a:da=aa44, a:ca=aa2} > 07/11/20 14:07:53 INFO mapred.LocalJobRunner: > 07/11/20 14:07:53 INFO mapred.TaskRunner: Task 'map_0000' done. > 07/11/20 14:07:53 INFO algebra.SortReduce: {a:=aa, b:=bb, a:da=aa44, a:ca=aa2} > 07/11/20 14:07:53 INFO algebra.SortReduce: {a:=aa3, b:=bb3, a:da=aa44, > a:ca=aa2} > 07/11/20 14:07:53 INFO algebra.SortReduce: {a:=aa4, b:=bb4, a:da=aa44, > a:ca=aa2} > 07/11/20 14:07:53 INFO mapred.LocalJobRunner: reduce > reduce > 07/11/20 14:07:53 INFO mapred.TaskRunner: Task 'reduce_9ji2mr' done. > {code} > Notice how content from the first row is present when you output the second > and third rows. > The problem is that in TIF, after calling scanner.next, it copies the > scanner.next value to the passed in MapWritable value (converting from > TreeMap to MapWritable). It resets the TreeMap passed to the scanner.next > each time but not the passed in MapWritable. > There is a similar problem in the reduce where the outputter is collecting > together values (see log above). Need to figure whats going on here. Below > is the reduce code: > {code} > [22:03] <edward__> while (values.hasNext()) { > [22:03] <edward__> MapWritable data = (MapWritable) values.next(); > [22:03] <edward__> Map<String, String> log = new HashMap<String, > String>(); > [22:03] <edward__> for (Map.Entry<Writable, Writable> e : > data.entrySet()) { > [22:03] <edward__> log.put(e.getKey().toString(), new > String(((ImmutableBytesWritable) e > [22:03] <edward__> .getValue()).get())); > [22:03] <edward__> } > [22:03] <edward__> LOG.info(log); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.