namnode replication

2008-06-30 Thread Vibhooti Verma
I have set my  property  as follows.
property
  namedfs.name.dir/name

value/apollo/env/TVHadoopCluster/var/tmp/hadoop/dfs/name,/local/namenode/value
  descriptionDetermines where on the local filesystem the DFS name node
  should store the name table.  If this is a comma-delimited list
  of directories then the name table is replicated in all of the
  directories, for redundancy. /description
/property



when  I start my dfs after that, it does not find all the directory
structure and hence cant start the namenode. has any one tried this before?
Please let me  know if i have to  create entire structure manually.

Regards,
VIbhooti

-- 
cheers,
Vibhooti


Data-local tasks

2008-06-30 Thread Saptarshi Guha

Hello,
	I recall asking this question but this is in addition to what I'ev  
askd.

Firstly, to recap my question and Arun's specific response:

--  On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote:  Hello, 
--	Does the Data-local map tasks counter mean the number of tasks   
that the had the input data already present on the machine on they   
are running on?

--  i.e the wasn't a need to ship the data to them.

Response from Arun
--	Yes. Your understanding is correct. More specifically it means that  
the map-task got scheduled on a machine on which one of the
--	replicas of it's input-split-block was present and was served by  
the datanode running on that machine. *smile* Arun



	Now, Is Hadoop designed to schedule a map task on a machine which has  
one of the replicas of it's input split block?
	Failing that, does then assign a map task on machine close to one  
that contains a replica of it's input split block?

Are there any performance metrics for this?

Many thanks
Saptarshi


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha



smime.p7s
Description: S/MIME cryptographic signature


reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Hi!

I'm running streaming tasks on hadoop 0.17.0, and wondered, if anyone has an 
approach to debugging the following situation:

-) map have all finished (100% in http display),
-) some reducers are hanging, with the messages below.

Notice, that the task had 100 map tasks at allo, so 58 seems like an 
extraordinary high number of missing parts, long after map has officially 
finished. Plus it seems to be deterministic, it always stop at 3 reduce parts 
not finishing, although I haven't yet checked if they are always the same 
errors or not.

 2008-06-30 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-06-30 15:25:41,954 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0 Got 0 known map output location(s);
 scheduling... 2008-06-30 15:25:41,954 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200806300847_0002_r_14_0
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-06-30
 15:25:46,770 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread
 done 2008-06-30 15:25:46,963 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
 15:25:46,963 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0: Got 0 new map-outputs  0 obsolete
 map-outputs from tasktracker and 0 map-outputs from previous failures
 2008-06-30 15:25:46,964 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200806300847_0002_r_14_0 Got 0 known map output location(s);
 scheduling... 2008-06-30 15:25:46,964 INFO
 org.apache.hadoop.mapred.ReduceTask: task_200806300847_0002_r_14_0
 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)

TIA,

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: RecordReader Functionality

2008-06-30 Thread Jorgen Johnson
Hi Sean,

Perhaps I'm missing something, but it doesn't appear to me that you're
actually seeking to the filesplit start position in your constructor...

This would explain why all the mappers are getting the same records.

-jorgenj

On Mon, Jun 30, 2008 at 9:22 AM, Sean Arietta [EMAIL PROTECTED] wrote:


 Hello all,
 I am having a problem writing my own RecordReader. The basic setup I have
 is
 a large byte array that needs to be diced up into key value pairs such that
 the key is the index into the array and the value is a byte array itself
 (albeit much smaller). Here is the code that I currently have written to do
 just this:


 /* This method is just the constructor for my new RecordReader
 public TrainingRecordReader(Configuration job, FileSplit split) throws
 IOException
{
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec =
 compressionCodecs.getCodec(file);

// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new TrainingReader(fileIn, job);
this.pos = start;
}

 // This returns the key, value pair I was talking about
 public synchronized boolean next(LongWritable key, BytesWritable value)
 throws IOException
{
if (pos = end)
return false;

key.set(pos);   // key is position
int newSize = in.readVector(value);
if (newSize  0)
{
pos += newSize;
return true;
}
return false;
}

 // This extracts that smaller byte array from the large input file
 public int readVector(BytesWritable value) throws IOException
{
int numBytes = in.read(buffer);
value.set(buffer, 0, numBytes);
return numBytes;
}

 So all of this worked just fine when I set conf.set(mapred.job.tracker,
 local), but now that I am attempting to test in a fake distributed
 setting
 (aka still one node, but I haven't set the above config param), I do not
 get
 what I want. Instead of getting unique key value pairs, I get repeated key
 value pairs based on the number of map tasks I have set. So, say that my
 large file contained 49 entries, I would want a unique key value pair for
 each of those, but if I set my numMapTasks to 7, I get 7 unique ones that
 repeat every 7 key value pairs.

 So it seems that each MapTask which ultimately calls my
 TrainingReader.next() method from above is somehow pointing to the same
 FileSplit. I know that in the LineRecordReader in the source there is some
 small little routine that skips the first line of the data if you aren't at
 the beginning Is that related? Why isn't it the case that
 split.getStart() isn't returning the absolute pointer to the start of the
 split? So many questions I don't know the answer to, haha.

 I would appreciate anyone's help in resolving this issue. Thanks very much!

 Cheers,
 Sean M. Arietta
 --
 View this message in context:
 http://www.nabble.com/RecordReader-Functionality-tp18199187p18199187.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Liberties are not given, they are taken.
- Aldous Huxley


Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Another observation, the TaskTracker$Child was alive, and the reduce script 
has hung on read(0, ) :(

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: joins in map reduce

2008-06-30 Thread Jason Venner

I have just started to try using the Join operators.

The join I am trying is this;
join is 
outer(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat,Input1),tbl(org.apache.hadoop.mapred.SequenceFileInputFormat,IndexedTry1))


but I get an error
08/06/30 08:55:13 INFO mapred.FileInputFormat: Total input paths to 
process : 10
Exception in thread main java.io.IOException: No input paths specified 
in input
   at 
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:115)

   at org.apache.hadoop.mapred.join.Parser$WNode.getSplits(Parser.java:304)
   at org.apache.hadoop.mapred.join.Parser$CNode.getSplits(Parser.java:375)
   at 
org.apache.hadoop.mapred.join.CompositeInputFormat.getSplits(CompositeInputFormat.java:131)

   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:544)

I am clearly missing something basic...

   conf.setInputFormat(CompositeInputFormat.class);
   conf.setOutputPath( outputDirectory );
   conf.setOutputKeyClass(Text.class);
   conf.setOutputValueClass(Text.class);
   conf.setOutputFormat(MapFileOutputFormat.class);
   conf.setMapperClass( LeftHandJoinMapper.class );
   conf.setReducerClass( IdentityReducer.class );
   conf.setNumReduceTasks(0);

   System.err.println( join is  + 
CompositeInputFormat.compose(outer, SequenceFileInputFormat.class, 
allTables ) );
   conf.set(mapred.join.expr, 
CompositeInputFormat.compose(outer, SequenceFileInputFormat.class, 
allTables ));
  
   JobClient client = new JobClient();
  
   client.setConf( conf );


   RunningJob job = JobClient.runJob( conf );



Shirley Cohen wrote:

Hi,

How does one do a join operation in map reduce? Is there more than one 
way to do a join? Which way works better and why?


Thanks,

Shirley

--
Jason Venner
Attributor - Program the Web http://www.attributor.com/
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


RE: RecordReader Functionality

2008-06-30 Thread Runping Qi

Your record reader must be able to find the beginning of the next record
beyond the start position of a given split. Your file format must enable
your record reader to detect the beginning of the next record beyond the
start pos of a split. It seems to me that is not possible based on the
info I saw so far.
Why not just use SequenceFile instead?

Runping


 -Original Message-
 From: Sean Arietta [mailto:[EMAIL PROTECTED]
 Sent: Monday, June 30, 2008 10:29 AM
 To: core-user@hadoop.apache.org
 Subject: Re: RecordReader Functionality
 
 
 I thought that the InStream buffer (called 'in' in this case) would
 maintain
 the stream position based on how many bytes I had 'read' via
in.read().
 Maybe this is not the case...
 
 Would it then be proper to call:
 
 in.seek(pos);
 
 I believe I tried this at one point and I got an error. I will try
again
 to
 be sure though. Thanks for your reply!
 
 Cheers,
 Sean
 
 
 Jorgen Johnson wrote:
 
  Hi Sean,
 
  Perhaps I'm missing something, but it doesn't appear to me that
you're
  actually seeking to the filesplit start position in your
constructor...
 
  This would explain why all the mappers are getting the same records.
 
  -jorgenj
 
  On Mon, Jun 30, 2008 at 9:22 AM, Sean Arietta
[EMAIL PROTECTED]
  wrote:
 
 
  Hello all,
  I am having a problem writing my own RecordReader. The basic setup
I
 have
  is
  a large byte array that needs to be diced up into key value pairs
such
  that
  the key is the index into the array and the value is a byte array
 itself
  (albeit much smaller). Here is the code that I currently have
written
 to
  do
  just this:
 
 
  /* This method is just the constructor for my new RecordReader
  public TrainingRecordReader(Configuration job, FileSplit split)
throws
  IOException
 {
 start = split.getStart();
 end = start + split.getLength();
 final Path file = split.getPath();
 compressionCodecs = new
CompressionCodecFactory(job);
 final CompressionCodec codec =
  compressionCodecs.getCodec(file);
 
 // open the file and seek to the start of the split
 FileSystem fs = file.getFileSystem(job);
 FSDataInputStream fileIn = fs.open(split.getPath());
 in = new TrainingReader(fileIn, job);
 this.pos = start;
 }
 
  // This returns the key, value pair I was talking about
  public synchronized boolean next(LongWritable key, BytesWritable
value)
  throws IOException
 {
 if (pos = end)
 return false;
 
 key.set(pos);   // key is position
 int newSize = in.readVector(value);
 if (newSize  0)
 {
 pos += newSize;
 return true;
 }
 return false;
 }
 
  // This extracts that smaller byte array from the large input file
  public int readVector(BytesWritable value) throws IOException
 {
 int numBytes = in.read(buffer);
 value.set(buffer, 0, numBytes);
 return numBytes;
 }
 
  So all of this worked just fine when I set
 conf.set(mapred.job.tracker,
  local), but now that I am attempting to test in a fake
distributed
  setting
  (aka still one node, but I haven't set the above config param), I
do
 not
  get
  what I want. Instead of getting unique key value pairs, I get
repeated
  key
  value pairs based on the number of map tasks I have set. So, say
that
 my
  large file contained 49 entries, I would want a unique key value
pair
 for
  each of those, but if I set my numMapTasks to 7, I get 7 unique
ones
 that
  repeat every 7 key value pairs.
 
  So it seems that each MapTask which ultimately calls my
  TrainingReader.next() method from above is somehow pointing to the
same
  FileSplit. I know that in the LineRecordReader in the source there
is
  some
  small little routine that skips the first line of the data if you
 aren't
  at
  the beginning Is that related? Why isn't it the case that
  split.getStart() isn't returning the absolute pointer to the start
of
 the
  split? So many questions I don't know the answer to, haha.
 
  I would appreciate anyone's help in resolving this issue. Thanks
very
  much!
 
  Cheers,
  Sean M. Arietta
  --
  View this message in context:
  http://www.nabble.com/RecordReader-Functionality-
 tp18199187p18199187.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 
  --
  Liberties are not given, they are taken.
  - Aldous Huxley
 
 
 
 --
 View this message in context: http://www.nabble.com/RecordReader-
 Functionality-tp18199187p18200404.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.



Parameterized InputFormats

2008-06-30 Thread Nathan Marz

Hello,

Are there any plans to change the JobConf API so that it takes an  
instance of an InputFormat rather than the InputFormat class? I am  
finding the inability to properly parameterize my InputFormats to be  
very restricting. What's the reasoning behind having the class as a  
parameter rather than an instance?


-Nathan Marz


problem when many map tasks are used (since 0.17.1 was installed)

2008-06-30 Thread Ashish Venugopal
The crash below occurs when I run many ( -jobconf mapred.map.tasks=200)
mappers. It does not occur if I set mapred.map.task=1 even when I allocated
many machines (causing there to be many mappers). But when I set
number of map.tasks to 200

the error below happens. This just started happening after the recent
upgrade to 0.17.1

(previously was using 0.16.4). This is a streaming job. Any help is appreciated.


Ashish

Exception closing file
/user/ashishv/iwslt/syn_baseline/translation_dev/_temporary/_task_200806272233_0001_m_000174_0/part-00174
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete
write to file /user/ashishv/iwslt/syn_baseline/translation_dev/_tem
porary/_task_200806272233_0001_m_000174_0/part-00174 by
DFSClient_task_200806272233_0001_m_000174_0
at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:332)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2655)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576)
at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:221)


RE: Hadoop - is it good for me and performance question

2008-06-30 Thread Haijun Cao
http://www.mail-archive.com/core-user@hadoop.apache.org/msg02906.html


-Original Message-
From: yair gotdanker [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.


Re: Too many fetch failures AND Shuffle error

2008-06-30 Thread Tarandeep Singh
I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat [EMAIL PROTECTED] wrote:

 Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
 able to fetch maps from the same machine (locally) but fails to copy it from
 the remote machine. A common reason in such cases is the *restricted machine
 access* (firewall etc). The web-server on a machine/node hosts map outputs
 which the reducers on the other machine are not able to access. There will
 be a URL associated with a map that the reducer try to fetch (check the
 reducer logs for this url). Just try accessing it manually from the
 reducer's machine/node. Most likely this experiment should also fail. Let us
 know if this is not the case.
 Amar

 Sayali Kulkarni wrote:

 Can you post the reducer logs. How many nodes are there in the cluster?


 There are 6 nodes in the cluster - 1 master and 5 slaves
  I tried to reduce the number of nodes, and found that the problem is
 solved only if there is a single node in the cluster. So I can deduce that
 the problem is there in some configuration.

 Configuration file:
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration

 property
  namehadoop.tmp.dir/name
  value/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
 /property

 property
  namefs.default.name/name
  valuehdfs://10.105.41.25:54310/value
  descriptionThe name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem./description
 /property

 property
  namemapred.job.tracker/name
  value10.105.41.25:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
 /property

 property
  namedfs.replication/name
  value2/value
  descriptionDefault block replication.
  The actual number of replications can be specified when the file is
 created.
  The default is used if replication is not specified in create time.
  /description
 /property


 property
  namemapred.child.java.opts/name
  value-Xmx1048M/value
 /property

 property
namemapred.local.dir/name
value/extra/HADOOP/hadoop-0.16.3/tmp/mapred/value
 /property

 property
  namemapred.map.tasks/name
  value53/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.
  /description
 /property

 property
  namemapred.reduce.tasks/name
  value7/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
 /property

 /configuration


 
 This is the output that I get when running the tasks with 2 nodes in the
 cluster:

 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
 process : 1
 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
 job_200806201106_0001
 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
 08/06/20 

Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ajay Anand
We are planning to host a mini-summit (aka Camp Hadoop) in conjunction
with ApacheCon this year - Nov 6th and 7th - in New Orleans.

 

We are working on putting together the agenda for this now, and would
love to hear from you if you have suggestions for talks or panel
discussions that we could include. Please send your suggestions to
[EMAIL PROTECTED]

 

Thanks!

Ajay



Re: Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ted Dunning
I would love to help, especially on the Mahout side of things.

What would you like to have?

On Mon, Jun 30, 2008 at 2:53 PM, Ajay Anand [EMAIL PROTECTED] wrote:

 We are planning to host a mini-summit (aka Camp Hadoop) in conjunction
 with ApacheCon this year - Nov 6th and 7th - in New Orleans.



 We are working on putting together the agenda for this now, and would
 love to hear from you if you have suggestions for talks or panel
 discussions that we could include. Please send your suggestions to
 [EMAIL PROTECTED]



 Thanks!

 Ajay




-- 
ted


RE: Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ajay Anand
At this point I am looking for proposals for talks or topics for panel
discussions - similar to the Summit we did a few months ago. The idea
would be to share with the community progress that's being made with
Hadoop related projects or discuss interesting applications /
deployments using Hadoop. 

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 30, 2008 4:35 PM
To: core-user@hadoop.apache.org
Subject: Re: Summit / Camp Hadoop at ApacheCon

I would love to help, especially on the Mahout side of things.

What would you like to have?

On Mon, Jun 30, 2008 at 2:53 PM, Ajay Anand [EMAIL PROTECTED]
wrote:

 We are planning to host a mini-summit (aka Camp Hadoop) in
conjunction
 with ApacheCon this year - Nov 6th and 7th - in New Orleans.



 We are working on putting together the agenda for this now, and would
 love to hear from you if you have suggestions for talks or panel
 discussions that we could include. Please send your suggestions to
 [EMAIL PROTECTED]



 Thanks!

 Ajay




-- 
ted


Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
On Monday 30 June 2008 18:38:28 Runping Qi wrote:
 Looks like the reducer stuck at shuffling phase.
 What is the progression percentage do you see for the reducer from web
 GUI?

 It is known that 0.17 does not handle shuffling well.

I think it has been 87% (meaning that 19 of 22 reducer tasks were finished). 
On a smaller job size, it hangs at 93%.

That makes me curious, when will 0.18 be out? Or 0.17.1? Till now I always 
managed to run into problems far behind the curve that there was almost 
always a cure in form of an upgrade. Not knowing if running trunk is a good 
idea.

Andreas


signature.asc
Description: This is a digitally signed message part.


Using S3 Block FileSystem as HDFS replacement

2008-06-30 Thread slitz
Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki
that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the task of
manually transferring data from S3 to HDFS every time i want to run a job.

I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a public
hadoop AMI since from my understanding hadoop-site.xml gets written on each
instance startup with the options on hadoop-init, and it seems that the
public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual configuration
anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom AMI with
a modified hadoop-init script right? or am I completely confused?


slitz


Re: Test Hadoop performance on EC2

2008-06-30 Thread 王志祥
Sorry for the previous post. I haven't finished. Please skip it.

Hi all,
I've made some experiments on Hadoop on Amazon EC2.
I would like to share the result and any feedback would be appreciated.

Environment:
-Xen VM (Amazon EC2 instance ami-ee53b687)
-1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network
bandwidth (small instance)
-Hadoop 0.17.0
-storage: HDFS
-Test example: wordcount

Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of
maps: 8, # of reduces: 8)
Data Size(MB) | Time(s)
512  |  124
256  |  70
128  |  41
...
8|  22
4|  17
2|  21

The purpose is to observe the lowest framework overhead for wordcount.
As the result, when the data size is between 2MB to 16MB, the time is around
20 second.
May I conclude the lowest framework overhead for wordcount is 20s?

Experiment 2: (variant # of instances (2~32), variant data size (128MB~2GB),
# of maps: (2-32), # of reduces: (2-32))
Data Size(MB) | Map | Reduce | Time(s)
2048 | 32  | 32 | 140
1024 | 16   | 16| 120
512  | 8| 8| 124
256  | 4| 4| 127
128  | 2| 2| 119

The purpose is to observe if each instance be allocated the same blocks of
data, the time will be similar.
As the result, when the data size is between 128MB to 1024MB, the time is
around 120 seconds.
The time is 140s when data size is 2048MB. I think the reason is more data
to process would cause more overhead.

Experiment 3: (variant # of instances (2~16), fixed data size (128MB), # of
maps: (2-16), # of reduces: (2-16))
Data Size(MB) | Map | Reduce | Time(s)
128  | 16   | 16| 31
128  | 8| 8| 41
128  | 4| 4| 69
128  | 2| 2| 119

The purpose is to observe for fixed data, add more and more instances, how
would the result change?
As the result, as the instances double, the time would be smaller but not
the half.
There is always the framework overhead even give infinite instances.

In fact, I did more experiments, but I just post some results.
Interestingly, I discover a formula for wordcount by my experiment result.
That is: Time(s) ~= 20+((DataSize - 8MB)*1.6 / (# of instance))
I've check the formula by all my experiment result and almost all is
matched.
Maybe it's coincidental or I have something wrong.
Anyway, I just want to share my experience and any feedback would be
appreciated.

-- 
Best Regards,
Shawn


Re: Too many fetch failures AND Shuffle error

2008-06-30 Thread Amar Kamat

Tarandeep Singh wrote:

I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.
  
One thing you can do is to see if all the maps that have failed while 
fetching are from remote host. Look at the web-ui to find out where the 
map task finished and look at the reduce task logs to find out which 
maps-fetches failed.


I am not sure if the reduce task logs have it. Try this
port=tasktracker.http.port (this is set through conf)
tthost = tasktracker hostname (destination tasktracker from where the 
map out needs to be fetched)

jobid = complete job id job_
mapid = the task attemptid attempt_... that has successfully completed 
the map
reduce-partition-id = this is the partition number for reduce task. 
task_..._r_$i_$j will have reduce-partition-id as int-value($i).


url = 
http://'$tthost':'$port'/mapOutput?job='$jobid'map='$mapid'reduce='$reduce-partition-id'

'$var' is what you have to substitute.
Amar

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat [EMAIL PROTECTED] wrote:

  

Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
able to fetch maps from the same machine (locally) but fails to copy it from
the remote machine. A common reason in such cases is the *restricted machine
access* (firewall etc). The web-server on a machine/node hosts map outputs
which the reducers on the other machine are not able to access. There will
be a URL associated with a map that the reducer try to fetch (check the
reducer logs for this url). Just try accessing it manually from the
reducer's machine/node. Most likely this experiment should also fail. Let us
know if this is not the case.
Amar

Sayali Kulkarni wrote:



Can you post the reducer logs. How many nodes are there in the cluster?
  


There are 6 nodes in the cluster - 1 master and 5 slaves
 I tried to reduce the number of nodes, and found that the problem is
solved only if there is a single node in the cluster. So I can deduce that
the problem is there in some configuration.

Configuration file:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration

property
 namehadoop.tmp.dir/name
 value/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}/value
 descriptionA base for other temporary directories./description
/property

property
 namefs.default.name/name
 valuehdfs://10.105.41.25:54310/value
 descriptionThe name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem./description
/property

property
 namemapred.job.tracker/name
 value10.105.41.25:54311/value
 descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map
 and reduce task.
 /description
/property

property
 namedfs.replication/name
 value2/value
 descriptionDefault block replication.
 The actual number of replications can be specified when the file is
created.
 The default is used if replication is not specified in create time.
 /description
/property


property
 namemapred.child.java.opts/name
 value-Xmx1048M/value
/property

property
   namemapred.local.dir/name
   value/extra/HADOOP/hadoop-0.16.3/tmp/mapred/value
/property

property
 namemapred.map.tasks/name
 value53/value
 descriptionThe default number of map tasks per job.  Typically set
 to a prime several times greater than number of available hosts.
 Ignored when mapred.job.tracker is local.
 /description
/property

property
 namemapred.reduce.tasks/name
 value7/value
 descriptionThe default number of reduce tasks per job.  Typically set
 to a prime close to the number of available hosts.  Ignored when
 mapred.job.tracker is local.
 /description
/property

/configuration



This is the output that I get when running the tasks with 2 nodes in the
cluster:

08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
process : 1
08/06/20 11:07:45 INFO mapred.JobClient: Running job:
job_200806201106_0001
08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
08/06/20 

Should there be a way not maintaining the whole namespace structure in memory?

2008-06-30 Thread heyongqiang
In now's hdfs implementation,all INodeFile and INodeDirectory objects were 
loaded into memory,this is done when setting up the  FSNameSpacs structure set 
up at namenode startup.
the namenode will analyze the fsimage file and edit log file. And if there are 
milllions of files or directories how it can be handled?

I have done an exprements by making dirs,before i exprements:
[EMAIL PROTECTED] bin]$ ps -p 9122 -o rss,size,vsize,%mem
  RSSSZVSZ %MEM
153648 1193868 1275340  3.7

after i creating 1 directories, it turns:
[EMAIL PROTECTED] bin]$ ps -p 9122 -o rss,size,vsize,%mem
  RSSSZVSZ %MEM
169084 1193868 1275340  4.0

I m trying to improve the fsimage file,so that namenode can locate and load the 
needed information at need,and just like linux vfs,we can only obtain an inode 
cache.So this can avoid loading the whole namespace structure at startup.




Best regards,
 
Yongqiang He
2008-07-01

Email: [EMAIL PROTECTED]
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 


Re: How to configure RandomWriter to generate less amount of data

2008-06-30 Thread Amar Kamat

Heshan Lin wrote:

Hi,

I'm trying to configure RandomWriter to generate less data than does 
the default configuration. 
bin/hadoop jar hadoop-*-examples.jar randomwriter 
-Dtest.randomwrite.bytes_per_map=value 
-Dtest.randomwrite.total_bytes=value 
-Dtest.randomwriter.maps_per_host=value input-filename
The number of maps that will be spawned in this case will be 
total_bytes/bytes_per_map.
Other parameters are test.randomwrite.min_key (size in bytes), 
test.randomwrite.max_key (size in bytes), test.randomwrite.min_value 
(size in bytes) and test.randomwrite.max_value (size in bytes).

Amar
I created a job configuration file job.xml and added in variables 
given at http://wiki.apache.org/hadoop/RandomWriter. Tried a couple of 
ways of running the program below, but configurations in job.xml were 
not taken by RandomWriter.


1) bin/hadoop jar hadoop-*-examples.jar randomwriter rand job.xml
2) bin/hadoop jar hadoop-*-examples.jar randomwriter rand --conf job.xml
3) bin/hadoop jar --conf job.xml hadoop-*-examples.jar randomwriter rand

Passing property values via the -D option didn't seem to work either. 
Can anybody advise on how to use the job configuration file properly?


Thanks,
Heshan