答复: Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Zhang, jian
Hi,

Please read this, you need to implement partitioner.
It controls which key is sent to which reducer, if u want to get unique key 
result, you need to implement partitioner and the compareTO function should 
work properly. 
[WIKI]
Partitioner

Partitioner partitions the key space.

Partitioner controls the partitioning of the keys of the intermediate 
map-outputs. The key (or a subset of the key) is used to derive the partition, 
typically by a hash function. The total number of partitions is the same as the 
number of reduce tasks for the job. Hence this controls which of the m reduce 
tasks the intermediate key (and hence the record) is sent to for reduction.

HashPartitioner is the default Partitioner.



Best Regards

Jian Zhang


-邮件原件-
发件人: Harish Mallipeddi [mailto:[EMAIL PROTECTED] 
发送时间: 2008年4月11日 19:06
收件人: core-user@hadoop.apache.org
主题: Problem with key aggregation when number of reduce tasks is more than 1

Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-0, output/part-1, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


What's the proper way to use hadoop task side-effect files?

2008-04-10 Thread Zhang, jian
Hi,

I was new to hadoop. Sorry for my novice question.
I got some problem while I was trying to use task side-effect files.
Since there is no code example in wiki, I tried this way:

I override cofigure method in reducer to create a side file,

 public void configure(JobConf conf){
 logger.info("Tring to create sideFiles inside reducer.!");

 Path workpath=conf.getOutputPath();
 Path sideFile= new Path(workpath,"SideFile.txt");
 try {
   FileSystem fs = FileSystem.get(conf);
   out= fs.create(sideFile);
 } catch (IOException e) {
   logger.error("Failed to create sidefile!");
 }
 }
And try to use it in reducer.

But I got some strange problems,
Even If the method is in reducer Class, mapper tasks are creating the
side files.
Mapper tasks hang because there are tring to recreate the file.

org.apache.hadoop.dfs.AlreadyBeingCreatedException:
 failed to create file
/data/input/MID06/_temporary/_task_200804112315_0001_m_08_0/SideFile
.txt for DFSClient_task_200804112315_0001_m_08_0 on client
192.168.0.203 because current leaseholder is trying to recreate file.
 at
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:9
74)
 at
org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931)
 at org.apache.hadoop.dfs.NameNode.create(NameNode.java:281)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:899)


Can anybody help me on this, how to use side-effect files? 



Best Regards

Jian Zhang



Questions about namenode and JobTracker configuration.

2008-02-21 Thread Zhang, jian
Hi, All

 

I have a small question about configuration.

 

In Hadoop Documentation page, it says 

" Typically you choose one machine in the cluster to act as the NameNode
and one machine as to act as the JobTracker, exclusively. The rest of
the machines act as both a DataNode and TaskTracker and are referred to
as slaves."

 

Does that mean the JobTracker is not a slave as NameNode ?

 

NameNode and DataNode form the HDFS. Since the JobTracker needs to
interact with TaskTracker which resides in HDFS, to make the
communication easier, I think it should be at least part of the HDFS.  

 

Best Regards

 

Jian Zhang