HOD is locked on Ringmaster at : xxx

2011-05-31 Thread relpak

Good afternoon everybody,

I've got a problem with HOD. When I try to allocate a new cluster with the
below command, HOD create a job with Torque and after it's locked on INFO/20
hadoop:541 - Cluster Id 61777.co-admin, DEBUG/10 hadoop:545 - Ringmaster at:
xxx :

hod allocate -d myDirectory -n 3

And if I run : hod info -d myDirectory, HOD returns no cluster.

Where is the problem?

Thank You for your help.
-- 
View this message in context: 
http://old.nabble.com/HOD-is-locked-on-Ringmaster-at-%3A-xxx-tp31739953p31739953.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Benchmarks with different workloads

2011-05-31 Thread Matthew John
Hi ,

I am looking out for Hadoop benchmarks that could characterize the following
workloads :

1) IO intensive workload

2) CPU intensive workload

3) Mixed (IO + CPU) workloads

Some one please throw some pointers on these!!

Thanks,
Matthew


Re: Benchmarks with different workloads

2011-05-31 Thread Cristina Abad
You could try SWIM [1].

-Cristina

[1] Yanpei Chen, Archana Ganapathi, Rean Griffith, Randy Katz . SWIM
- Statistical Workload Injector for MapReduce. Available at:
http://www.eecs.berkeley.edu/~ychen2/SWIM.html

 -- Forwarded message --
 From: Matthew John tmatthewjohn1...@gmail.com
 To: common-user common-user@hadoop.apache.org
 Date: Tue, 31 May 2011 20:01:25 +0530
 Subject: Benchmarks with different workloads
 Hi ,

 I am looking out for Hadoop benchmarks that could characterize the following
 workloads :

 1) IO intensive workload

 2) CPU intensive workload

 3) Mixed (IO + CPU) workloads

 Some one please throw some pointers on these!!

 Thanks,
 Matthew




Hadoop project - help needed

2011-05-31 Thread parismav

Hello dear forum, 
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key. 

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code. 
What input do i have to give the Map() function? 
do i have to contain this whole info downloading process in the map()
function? 

In a few words, how do i convert my code so that it runs distributedly on
hadoop? 
thank u!
-- 
View this message in context: 
http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Starting a Hadoop job outside the cluster

2011-05-31 Thread Steve Lewis
I have tried what you suggest (well sort of) a goof example would help alot
-
My reducer is set to among other things emit the local os and user.dir -
when I try running from
my windows box these appear on hdfs but show the windows os and user.dir
leading me to believe that the reducer is still running on my windows
machine - I will
check the values but a working example would be very useful


On Sun, May 29, 2011 at 6:19 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:

 Would it not also be possible for a Windows machine to submit the job
 directly from a Java process? This way you don't need Cygwin / a full local
 copy of the installation (correct my if I'm wrong). The steps would then
 just be:
 1) Create a basic Java project, add minimum required libraries
 (Hadoop/logging)
 2) Set the essential properties (at least this would be the jobtracker and
 the filesystem)
 3) Implement the Tool
 4) Run the process (from either the IDE or stand-alone jar)

 Steps 1-3 could technically be implemented on another machine, if you
 choose to compile a stand-alone jar.

 Ferdy.


 On 05/29/2011 04:50 AM, Harsh J wrote:

 Keep a local Hadoop installation with a mirror-copy config, and use
 hadoop jarjar to submit as usual (since the config points to the
 right areas, the jobs go there).

 For Windows you'd need Cygwin installed, however.

 On Sun, May 29, 2011 at 12:56 AM, Steve Lewislordjoe2...@gmail.com
  wrote:

 When I want to launch a hadoop job I use SCP to execute a command on the
 Name node machine. I an wondering if there is
 a way to launch a Hadoop job from a machine that is not on the cluster.
 How
 to do this on a Windows box or a Mac would be
 of special interest.

 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com






-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Hadoop project - help needed

2011-05-31 Thread Robert Evans
Parismav,

So you are more or less trying to scrape some data in a distributed way.  Well 
there are several things that you could do, just be careful I am not sure the 
terms of service for the flickr APIs so make sure that you are not violating 
them by downloading too much data.  You probably want to use the map input data 
to be command/control for what the mappers do.  I would probably put in a 
format like

ACCOUT INFO\tGROUP INFO\n

Then you could use the N-line input format so that each mapper will process one 
line out of the file.  Something like (This is just psudo code)

MapperLong, String, ?, ? {
  map(Long offset, String line,...) {
String parts = line.split(\t);
openConnection(parts[0]);
GroupData gd = getDataAboutGroup(parts[1]);
...
  }
}

I would probably not bother with a reducer if all you are doing is pulling down 
data.  Also the output format you choose really depends on the type of data you 
are downloading, and how you want to use that data later.  For example if you 
want to download the actual picture then you probably want to use a sequence 
file format or some other binary format, because converting a picture to text 
can be very costly.

--Bobby Evans

On 5/31/11 10:35 AM, parismav paok_gate...@hotmail.com wrote:



Hello dear forum,
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key.

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code.
What input do i have to give the Map() function?
do i have to contain this whole info downloading process in the map()
function?

In a few words, how do i convert my code so that it runs distributedly on
hadoop?
thank u!
--
View this message in context: 
http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Starting a Hadoop job outside the cluster

2011-05-31 Thread Harsh J
Steve,

What do you mean when you say it shows windows os and user.dir?
There will be a few properties in the job.xml that may carry client
machine information but these shouldn't be a hinderance.

Unless a TaskTracker was started on the Windows box (no daemons ought
to be started on the client machine), no task may run on it.

On Tue, May 31, 2011 at 9:15 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 I have tried what you suggest (well sort of) a goof example would help alot
 -
 My reducer is set to among other things emit the local os and user.dir -
 when I try running from
 my windows box these appear on hdfs but show the windows os and user.dir
 leading me to believe that the reducer is still running on my windows
 machine - I will
 check the values but a working example would be very useful

 On Sun, May 29, 2011 at 6:19 AM, Ferdy Galema ferdy.gal...@kalooga.com
 wrote:

 Would it not also be possible for a Windows machine to submit the job
 directly from a Java process? This way you don't need Cygwin / a full local
 copy of the installation (correct my if I'm wrong). The steps would then
 just be:
 1) Create a basic Java project, add minimum required libraries
 (Hadoop/logging)
 2) Set the essential properties (at least this would be the jobtracker and
 the filesystem)
 3) Implement the Tool
 4) Run the process (from either the IDE or stand-alone jar)

 Steps 1-3 could technically be implemented on another machine, if you
 choose to compile a stand-alone jar.

 Ferdy.

 On 05/29/2011 04:50 AM, Harsh J wrote:

 Keep a local Hadoop installation with a mirror-copy config, and use
 hadoop jarjar to submit as usual (since the config points to the
 right areas, the jobs go there).

 For Windows you'd need Cygwin installed, however.

 On Sun, May 29, 2011 at 12:56 AM, Steve Lewislordjoe2...@gmail.com
  wrote:

 When I want to launch a hadoop job I use SCP to execute a command on the
 Name node machine. I an wondering if there is
 a way to launch a Hadoop job from a machine that is not on the cluster.
 How
 to do this on a Windows box or a Mac would be
 of special interest.

 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com






 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com






-- 
Harsh J


Is something wrong when I'm shuffling 2 bytes into RAM?

2011-05-31 Thread W.P. McNeill
For some of my jobs I'll see long stretches of log files that look like
this:

INFO mapred.ReduceTask: Shuffling 2 bytes (14 raw bytes) into RAM from
attempt_201105041713_5850_m_002764_0
INFO mapred.ReduceTask: Read 2 bytes from map-output for
attempt_201105041713_5850_m_002764_0
INFO mapred.ReduceTask: attempt_201105041713_5850_r_17_2 Scheduled 1
outputs (0 slow hosts and74 dup hosts)
INFO mapred.ReduceTask: Rec #1 from attempt_201105041713_5850_m_002764_0 -
(-1, -1) from hnode52.tuk2.intelius.com
INFO mapred.ReduceTask: header: attempt_201105041713_5850_m_002729_0,
compressed len: 14, decompressed len: 2
INFO mapred.ReduceTask: Shuffling 2 bytes (14 raw bytes) into RAM from
attempt_201105041713_5850_m_002729_0
INFO mapred.ReduceTask: Read 2 bytes from map-output for
attempt_201105041713_5850_m_002729_0
INFO mapred.ReduceTask: Rec #1 from attempt_201105041713_5850_m_002729_0 -
(-1, -1) from hnode42.tuk2.intelius.com
INFO mapred.ReduceTask: attempt_201105041713_5850_r_17_2 Scheduled 1
outputs (0 slow hosts and70 dup hosts)
INFO mapred.ReduceTask: attempt_201105041713_5850_r_17_2 Scheduled 1
outputs (0 slow hosts and64 dup hosts)
INFO mapred.ReduceTask: header: attempt_201105041713_5850_m_003036_0,
compressed len: 14, decompressed len: 2

This looks really wrong to me. Am I correct in thinking that when I'm
shuffling 2 bytes into memory at a time I've got a real performance problem?
Does anyone have ideas as to what might be going on here?

The jobs that hit this sometimes work and sometimes fail for reasons that
may or may not be related to the logs excerpted above.


Re: Starting a Hadoop job outside the cluster

2011-05-31 Thread Steve Lewis
My Reducer code says this:
 public static class Reduce extends ReducerText, Text, Text, Text {
private boolean m_DateSent;

/**
 * This method is called once for each key. Most applications will
define
 * their reduce class by overriding this method. The default
implementation
 * is an identity function.
 */
@Override
protected void reduce(Text key, IterableText values,
  Context context)
throws IOException, InterruptedException {
if (!m_DateSent) {
Text dkey = new Text(CreationDate);
Text dValue = new Text();
writeKeyValue(context, dkey,dValue,CreationDate,new
Date().toString());
writeKeyValue(context,
dkey,dValue,user.dir,System.getProperty(user.dir));
writeKeyValue(context,
dkey,dValue,os.arch,System.getProperty(os.arch));
writeKeyValue(context, dkey,dValue,os.name
,System.getProperty(os.name));



//dkey.set(ip);
//java.net.InetAddress addr =
java.net.InetAddress.getLocalHost();
//dValue.set(System.getProperty(addr.toString()));
//context.write(dkey, dValue);

m_DateSent = true;
}
IteratorText itr = values.iterator();
// Add interesting code here
while (itr.hasNext()) {
Text vCheck = itr.next();
context.write(key, vCheck);
}

}


}

if os.arch is linux I am running on the cluster -
if windows I am running locally

I run this main hoping to run on the cluster with the NameNode and Job
Tracker at glados

   public static void main(String[] args) throws Exception {
String outFile = ./out;
Configuration conf = new Configuration();

// cause output to go to the cluster
conf.set(fs.default.name, hdfs://glados:9000/);
conf.set(mapreduce.jobtracker.address, glados:9000/);
conf.set(mapred.jar, NShot.jar);

   conf.set(fs.defaultFS,hdfs://glados:9000/);


Job job = new Job(conf, Generated data);
conf = job.getConfiguration();
job.setJarByClass(NShotInputFormat.class);



   .. Other setup code ...

boolean ans = job.waitForCompletion(true);
int ret = ans ? 0 : 1;
}



On Tue, May 31, 2011 at 9:35 AM, Harsh J ha...@cloudera.com wrote:

 Steve,

 What do you mean when you say it shows windows os and user.dir?
 There will be a few properties in the job.xml that may carry client
 machine information but these shouldn't be a hinderance.

 Unless a TaskTracker was started on the Windows box (no daemons ought
 to be started on the client machine), no task may run on it.

 On Tue, May 31, 2011 at 9:15 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
  I have tried what you suggest (well sort of) a goof example would help
 alot
  -
  My reducer is set to among other things emit the local os and user.dir -
  when I try running from
  my windows box these appear on hdfs but show the windows os and user.dir
  leading me to believe that the reducer is still running on my windows
  machine - I will
  check the values but a working example would be very useful
 
  On Sun, May 29, 2011 at 6:19 AM, Ferdy Galema ferdy.gal...@kalooga.com
  wrote:
 
  Would it not also be possible for a Windows machine to submit the job
  directly from a Java process? This way you don't need Cygwin / a full
 local
  copy of the installation (correct my if I'm wrong). The steps would then
  just be:
  1) Create a basic Java project, add minimum required libraries
  (Hadoop/logging)
  2) Set the essential properties (at least this would be the jobtracker
 and
  the filesystem)
  3) Implement the Tool
  4) Run the process (from either the IDE or stand-alone jar)
 
  Steps 1-3 could technically be implemented on another machine, if you
  choose to compile a stand-alone jar.
 
  Ferdy.
 
  On 05/29/2011 04:50 AM, Harsh J wrote:
 
  Keep a local Hadoop installation with a mirror-copy config, and use
  hadoop jarjar to submit as usual (since the config points to the
  right areas, the jobs go there).
 
  For Windows you'd need Cygwin installed, however.
 
  On Sun, May 29, 2011 at 12:56 AM, Steve Lewislordjoe2...@gmail.com
   wrote:
 
  When I want to launch a hadoop job I use SCP to execute a command on
 the
  Name node machine. I an wondering if there is
  a way to launch a Hadoop job from a machine that is not on the
 cluster.
  How
  to do this on a Windows box or a Mac would be
  of special interest.
 
  --
  Steven M. Lewis PhD
  4221 105th Ave NE
  Kirkland, WA 98033
  206-384-1340 (cell)
  Skype lordjoe_com
 
 
 
 
 
 
  --
  Steven M. Lewis PhD
  4221 105th Ave NE
  Kirkland, WA 98033
  206-384-1340 (cell)
  Skype lordjoe_com
 
 
 



 --
 Harsh J




-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: Starting a Hadoop job outside the cluster

2011-05-31 Thread Harsh J
Simply remove that trailing slash (forgot to catch it earlier, sorry)
and you should be set (or at least more set than before surely.)

On Tue, May 31, 2011 at 10:51 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 0.20.2 - we have been avoiding 0.21 because it is not terribly stable and
 made some MAJOR changes to
 critical classes

 When I say

         Configuration conf = new Configuration();
         // cause output to go to the cluster
         conf.set(fs.default.name, hdfs://glados:9000/);
    //     conf.set(mapreduce.jobtracker.address, glados:9000/);
         conf.set(mapred.job.tracker, glados:9000/);
         conf.set(mapred.jar, NShot.jar);
         //  conf.set(fs.defaultFS,hdfs://glados:9000/);
         String[] otherArgs = new GenericOptionsParser(conf,
 args).getRemainingArgs();
 //        if (otherArgs.length != 2) {
 //            System.err.println(Usage: wordcount in out);
 //            System.exit(2);
 //        }
         Job job = new Job(conf, Generated data);
 I get
 Exception in thread main java.lang.IllegalArgumentException:
 java.net.URISyntaxException: Relative path in absolute URI: glados:9000
 at org.apache.hadoop.fs.Path.initialize(Path.java:140)
 at org.apache.hadoop.fs.Path.init(Path.java:126)
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:150)
 at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
 at org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1807)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:423)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:410)
 at org.apache.hadoop.mapreduce.Job.init(Job.java:50)
 at org.apache.hadoop.mapreduce.Job.init(Job.java:54)
 at org.systemsbiology.hadoopgenerated.NShotTest.main(NShotTest.java:188)
 Caused by: java.net.URISyntaxException: Relative path in absolute URI:
 glados:9000
 at java.net.URI.checkPath(URI.java:1787)
 at java.net.URI.init(URI.java:735)
 at org.apache.hadoop.fs.Path.initialize(Path.java:137)

 I promise to publish a working example if this ever works
 On Tue, May 31, 2011 at 10:02 AM, Harsh J ha...@cloudera.com wrote:

 Steve,

 On Tue, May 31, 2011 at 10:27 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:
  My Reducer code says this:
  dkey,dValue,os.arch,System.getProperty(os.arch));
                  writeKeyValue(context,
  dkey,dValue,os.name,System.getProperty(os.name));
  if os.arch is linux I am running on the cluster -
  if windows I am running locally

 Correct, so it should be Linux since these are System properties, and
 if you're getting Windows its probably running locally on your client
 box itself!

          conf.set(mapreduce.jobtracker.address, glados:9000/);

 This here might be your problem. That form of property would only work
 with 0.21.x, while on 0.20.x if you do not set it as
 mapred.job.tracker then the local job runner takes over by default,
 thereby making this odd thing happen (that's my guess).

 What version of Hadoop are you using?

 --
 Harsh J



 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com






-- 
Harsh J


Re: Hadoop project - help needed

2011-05-31 Thread jagaran das
Hi,

To be very precise,
input to the mapper should be something you want to filter on basis of which 
you 
want to do the aggregation.
The Reducer is where you aggregate the output from mapper.

Check the WordCount Example in Hadoop, it can help you to understand the basic 
concepts.

Cheers,
Jagaran 




From: parismav paok_gate...@hotmail.com
To: core-u...@hadoop.apache.org
Sent: Tue, 31 May, 2011 8:35:27 AM
Subject: Hadoop project - help needed


Hello dear forum, 
i am working on a project on apache Hadoop, i am totally new to this
software and i need some help understanding the basic features!

To sum up, for my project i have configured hadoop so that it runs 3
datanodes on one machine.
The project's main goal is, to use both Flickr API (flickr.com) libraries
and hadoop libraries on Java, so that each one of the 3 datanodes, chooses a
Flickr group and returns photos' info from that group.

In order to do that, i have 3 flickr accounts, each one with a different api
key. 

I dont need any help on the flickr side of the code, ofcourse. But what i
dont understand, is how to use the Mapper and Reducer part of the code. 
What input do i have to give the Map() function? 
do i have to contain this whole info downloading process in the map()
function? 

In a few words, how do i convert my code so that it runs distributedly on
hadoop? 
thank u!
-- 
View this message in context: 
http://old.nabble.com/Hadoop-project---help-needed-tp31741968p31741968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

trying to select technology

2011-05-31 Thread cs230

Hello All,

I am planning to start project where I have to do extensive storage of xml
and text files. On top of that I have to implement efficient algorithm for
searching over thousands or millions of files, and also do some indexes to
make search faster next time. 

I looked into Oracle database but it delivers very poor result. Can I use
Hadoop for this? Which Hadoop project would be best fit for this? 

Is there anything from Google I can use? 

Thanks a lot in advance.
-- 
View this message in context: 
http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: trying to select technology

2011-05-31 Thread jagaran das
Think of Lucene and Apache SOLR

Cheers,
Jagaran 




From: cs230 chintanjs...@gmail.com
To: core-u...@hadoop.apache.org
Sent: Tue, 31 May, 2011 10:50:49 AM
Subject: trying to select technology


Hello All,

I am planning to start project where I have to do extensive storage of xml
and text files. On top of that I have to implement efficient algorithm for
searching over thousands or millions of files, and also do some indexes to
make search faster next time. 

I looked into Oracle database but it delivers very poor result. Can I use
Hadoop for this? Which Hadoop project would be best fit for this? 

Is there anything from Google I can use? 

Thanks a lot in advance.
-- 
View this message in context: 
http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: trying to select technology

2011-05-31 Thread Matthew Foley
Sounds like you're looking for a full-text inverted index.  Lucene is a good 
opensource implementation of that.  I believe it has an option for storing the 
original full text as well as the indexes.
--Matt

On May 31, 2011, at 10:50 AM, cs230 wrote:


Hello All,

I am planning to start project where I have to do extensive storage of xml
and text files. On top of that I have to implement efficient algorithm for
searching over thousands or millions of files, and also do some indexes to
make search faster next time. 

I looked into Oracle database but it delivers very poor result. Can I use
Hadoop for this? Which Hadoop project would be best fit for this? 

Is there anything from Google I can use? 

Thanks a lot in advance.
-- 
View this message in context: 
http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: trying to select technology

2011-05-31 Thread Ted Dunning
To pile on, thousands or millions of documents are well within the range
that is well addressed by Lucene.

Solr may be an even better option than bare Lucene since it handles lots of
the boilerplate problems like document parsing and index update scheduling.

On Tue, May 31, 2011 at 11:56 AM, Matthew Foley ma...@yahoo-inc.com wrote:

 Sounds like you're looking for a full-text inverted index.  Lucene is a
 good opensource implementation of that.  I believe it has an option for
 storing the original full text as well as the indexes.
 --Matt

 On May 31, 2011, at 10:50 AM, cs230 wrote:


 Hello All,

 I am planning to start project where I have to do extensive storage of xml
 and text files. On top of that I have to implement efficient algorithm for
 searching over thousands or millions of files, and also do some indexes to
 make search faster next time.

 I looked into Oracle database but it delivers very poor result. Can I use
 Hadoop for this? Which Hadoop project would be best fit for this?

 Is there anything from Google I can use?

 Thanks a lot in advance.
 --
 View this message in context:
 http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: No. of Map and reduce tasks

2011-05-31 Thread Mohit Anchlia
What if I had multiple files in input directory, hadoop should then
fire parallel map jobs?


On Thu, May 26, 2011 at 7:21 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 If you give really low size files, then the use of Big Block Size of Hadoop
 goes away.
 Instead try merging files.

 Hope that helps



 
 From: James Seigel ja...@tynt.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 6:04:07 PM
 Subject: Re: No. of Map and reduce tasks

 Set input split size really low,  you might get something.

 I'd rather you fire up some nix commands and pack together that file
 onto itself a bunch if times and the put it back into hdfs and let 'er
 rip

 Sent from my mobile. Please excuse the typos.

 On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I think I understand that by last 2 replies :)  But my question is can
 I change this configuration to say split file into 250K so that
 multiple mappers can be invoked?

 On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in
 wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.






Re: Why don't my jobs get preempted?

2011-05-31 Thread Edward Capriolo
On Tue, May 31, 2011 at 2:50 PM, W.P. McNeill bill...@gmail.com wrote:

 I'm launching long-running tasks on a cluster running the Fair Scheduler.
  As I understand it, the Fair Scheduler is preemptive. What I expect to see
 is that my long-running jobs sometimes get killed to make room for other
 people's jobs. This never happens instead my long-running jobs hog mapper
 and reducer slots and starve other people out.

 Am I misunderstanding how the Fair Scheduler works?


Try adding

minSharePreemptionTimeout120/minSharePreemptionTimeout
fairSharePreemptionTimeout180/fairSharePreemptionTimeout

To one of your pools and see if that pool pre-empts other pools


Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Juan P.
Hi Guys,
I recently configured my cluster to have 2 VMs. I configured 1
machine (slave3) to be the namenode and another to be the
jobtracker (slave2). They both work as datanode/tasktracker as well.

Both configs have the following contents in their masters and slaves file:
*slave2*
*slave3*

Both machines have the following contents on their mapred-site.xml file:
*?xml version=1.0?*
*?xml-stylesheet type=text/xsl href=configuration.xsl?*
*
*
*!-- Put site-specific property overrides in this file. --*
*
*
*configuration*
* property*
* namemapred.job.tracker/name*
* valueslave2:9001/value*
* /property*
*/configuration*

Both machines have the following contents on their core-site.xml file:
*?xml version=1.0?*
*?xml-stylesheet type=text/xsl href=configuration.xsl?*
*
*
*!-- Put site-specific property overrides in this file. --*
*
*
*configuration*
* property*
* namefs.default.name/name*
* valuehdfs://slave3:9000/value*
* /property*
*/configuration*

When I log into the namenode and I run the start-all.sh script, everything
but the jobtracker starts. In the log files I get the following exception:

*/*
*STARTUP_MSG: Starting JobTracker*
*STARTUP_MSG:   host = slave3/10.20.11.112*
*STARTUP_MSG:   args = []*
*STARTUP_MSG:   version = 0.20.2*
*STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
*/*
*2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
*2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
assign requested address*
*at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
*at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
*at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
*at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
*at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
*at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
*
*at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
*at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
*at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
*Caused by: java.net.BindException: Cannot assign requested address*
*at sun.nio.ch.Net.bind(Native Method)*
*at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
*at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
*
*at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
*... 8 more*
*
*
*2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
SHUTDOWN_MSG:*
*/*
*SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
*/*


As I see it, from the lines

*STARTUP_MSG: Starting JobTracker*
*STARTUP_MSG:   host = slave3/10.20.11.112*

the namenode (slave3) is trying to run the jobtracker locally but when it
starts the jobtracker server it binds it to the slave2 address and of course
fails:

*Problem binding to slave2/10.20.11.166:9001*

What do you guys think could be going wrong?

Thanks!
Pony


Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Konstantin Boudnik
This seems to be your problem, really...
* namemapred.job.tracker/name*
* valueslave2:9001/value*

On Tue, May 31, 2011 at 06:07PM, Juan P. wrote:
 Hi Guys,
 I recently configured my cluster to have 2 VMs. I configured 1
 machine (slave3) to be the namenode and another to be the
 jobtracker (slave2). They both work as datanode/tasktracker as well.
 
 Both configs have the following contents in their masters and slaves file:
 *slave2*
 *slave3*
 
 Both machines have the following contents on their mapred-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namemapred.job.tracker/name*
 * valueslave2:9001/value*
 * /property*
 */configuration*
 
 Both machines have the following contents on their core-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namefs.default.name/name*
 * valuehdfs://slave3:9000/value*
 * /property*
 */configuration*
 
 When I log into the namenode and I run the start-all.sh script, everything
 but the jobtracker starts. In the log files I get the following exception:
 
 */*
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 *STARTUP_MSG:   args = []*
 *STARTUP_MSG:   version = 0.20.2*
 *STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
 */*
 *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
 *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
 java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
 assign requested address*
 *at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
 *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
 *at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
 *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
 *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
 *at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
 *
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
 *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
 *Caused by: java.net.BindException: Cannot assign requested address*
 *at sun.nio.ch.Net.bind(Native Method)*
 *at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
 *at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
 *
 *at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
 *... 8 more*
 *
 *
 *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
 SHUTDOWN_MSG:*
 */*
 *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
 */*
 
 
 As I see it, from the lines
 
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 
 the namenode (slave3) is trying to run the jobtracker locally but when it
 starts the jobtracker server it binds it to the slave2 address and of course
 fails:
 
 *Problem binding to slave2/10.20.11.166:9001*
 
 What do you guys think could be going wrong?
 
 Thanks!
 Pony


Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Joey Echeverria
The problem is that start-all.sh isn't all that intelligent. The way
that start-all.sh works is by running start-dfs.sh and
start-mapred.sh. The start-mapred.sh script always starts a job
tracker on the local host and a task tracker on all of the hosts
listed in slaves (it uses SSH to do the remote execution). The
start-dfs.sh script always starts a name node on the local host, a
data node on all of the hosts listed in slaves, and a secondary name
node on all of the hosts listed in masters.

In your case, you'll want to run start-dfs.sh on slave3 and
start-mapred.sh on slave2.

-Joey

On Tue, May 31, 2011 at 5:07 PM, Juan P. gordoslo...@gmail.com wrote:
 Hi Guys,
 I recently configured my cluster to have 2 VMs. I configured 1
 machine (slave3) to be the namenode and another to be the
 jobtracker (slave2). They both work as datanode/tasktracker as well.

 Both configs have the following contents in their masters and slaves file:
 *slave2*
 *slave3*

 Both machines have the following contents on their mapred-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namemapred.job.tracker/name*
 * valueslave2:9001/value*
 * /property*
 */configuration*

 Both machines have the following contents on their core-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namefs.default.name/name*
 * valuehdfs://slave3:9000/value*
 * /property*
 */configuration*

 When I log into the namenode and I run the start-all.sh script, everything
 but the jobtracker starts. In the log files I get the following exception:

 */*
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 *STARTUP_MSG:   args = []*
 *STARTUP_MSG:   version = 0.20.2*
 *STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
 */*
 *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
 *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
 java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
 assign requested address*
 *        at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
 *        at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
 *        at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
 *        at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
 *        at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
 *        at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
 *
 *        at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
 *        at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
 *        at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
 *Caused by: java.net.BindException: Cannot assign requested address*
 *        at sun.nio.ch.Net.bind(Native Method)*
 *        at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
 *        at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
 *
 *        at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
 *        ... 8 more*
 *
 *
 *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
 SHUTDOWN_MSG:*
 */*
 *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
 */*


 As I see it, from the lines

 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*

 the namenode (slave3) is trying to run the jobtracker locally but when it
 starts the jobtracker server it binds it to the slave2 address and of course
 fails:

 *Problem binding to slave2/10.20.11.166:9001*

 What do you guys think could be going wrong?

 Thanks!
 Pony




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread gordoslocos
Eeeeh why? Isnt That the config for the jobtracker? Slave2 has been defined 
in my /etc/hosts files.
Should those lines not be in both nodes?

Thanks for helping!
Pony

On 31/05/2011, at 18:12, Konstantin Boudnik c...@apache.org wrote:

 This seems to be your problem, really...
 * namemapred.job.tracker/name*
 * valueslave2:9001/value*
 
 On Tue, May 31, 2011 at 06:07PM, Juan P. wrote:
 Hi Guys,
 I recently configured my cluster to have 2 VMs. I configured 1
 machine (slave3) to be the namenode and another to be the
 jobtracker (slave2). They both work as datanode/tasktracker as well.
 
 Both configs have the following contents in their masters and slaves file:
 *slave2*
 *slave3*
 
 Both machines have the following contents on their mapred-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namemapred.job.tracker/name*
 * valueslave2:9001/value*
 * /property*
 */configuration*
 
 Both machines have the following contents on their core-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namefs.default.name/name*
 * valuehdfs://slave3:9000/value*
 * /property*
 */configuration*
 
 When I log into the namenode and I run the start-all.sh script, everything
 but the jobtracker starts. In the log files I get the following exception:
 
 */*
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 *STARTUP_MSG:   args = []*
 *STARTUP_MSG:   version = 0.20.2*
 *STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
 */*
 *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
 *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
 java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
 assign requested address*
 *at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
 *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
 *at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
 *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
 *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
 *at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
 *
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
 *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
 *Caused by: java.net.BindException: Cannot assign requested address*
 *at sun.nio.ch.Net.bind(Native Method)*
 *at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
 *at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
 *
 *at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
 *... 8 more*
 *
 *
 *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
 SHUTDOWN_MSG:*
 */*
 *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
 */*
 
 
 As I see it, from the lines
 
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 
 the namenode (slave3) is trying to run the jobtracker locally but when it
 starts the jobtracker server it binds it to the slave2 address and of course
 fails:
 
 *Problem binding to slave2/10.20.11.166:9001*
 
 What do you guys think could be going wrong?
 
 Thanks!
 Pony


Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread gordoslocos
:D i'll give that a try 1st thing in the morning! Thanks a lot joey!!

Sent from my iPhone

On 31/05/2011, at 18:18, Joey Echeverria j...@cloudera.com wrote:

 The problem is that start-all.sh isn't all that intelligent. The way
 that start-all.sh works is by running start-dfs.sh and
 start-mapred.sh. The start-mapred.sh script always starts a job
 tracker on the local host and a task tracker on all of the hosts
 listed in slaves (it uses SSH to do the remote execution). The
 start-dfs.sh script always starts a name node on the local host, a
 data node on all of the hosts listed in slaves, and a secondary name
 node on all of the hosts listed in masters.
 
 In your case, you'll want to run start-dfs.sh on slave3 and
 start-mapred.sh on slave2.
 
 -Joey
 
 On Tue, May 31, 2011 at 5:07 PM, Juan P. gordoslo...@gmail.com wrote:
 Hi Guys,
 I recently configured my cluster to have 2 VMs. I configured 1
 machine (slave3) to be the namenode and another to be the
 jobtracker (slave2). They both work as datanode/tasktracker as well.
 
 Both configs have the following contents in their masters and slaves file:
 *slave2*
 *slave3*
 
 Both machines have the following contents on their mapred-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namemapred.job.tracker/name*
 * valueslave2:9001/value*
 * /property*
 */configuration*
 
 Both machines have the following contents on their core-site.xml file:
 *?xml version=1.0?*
 *?xml-stylesheet type=text/xsl href=configuration.xsl?*
 *
 *
 *!-- Put site-specific property overrides in this file. --*
 *
 *
 *configuration*
 * property*
 * namefs.default.name/name*
 * valuehdfs://slave3:9000/value*
 * /property*
 */configuration*
 
 When I log into the namenode and I run the start-all.sh script, everything
 but the jobtracker starts. In the log files I get the following exception:
 
 */*
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 *STARTUP_MSG:   args = []*
 *STARTUP_MSG:   version = 0.20.2*
 *STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
 */*
 *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
 *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
 java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : Cannot
 assign requested address*
 *at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
 *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
 *at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
 *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
 *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
 *at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
 *
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
 *at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
 *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
 *Caused by: java.net.BindException: Cannot assign requested address*
 *at sun.nio.ch.Net.bind(Native Method)*
 *at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
 *at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
 *
 *at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
 *... 8 more*
 *
 *
 *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
 SHUTDOWN_MSG:*
 */*
 *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
 */*
 
 
 As I see it, from the lines
 
 *STARTUP_MSG: Starting JobTracker*
 *STARTUP_MSG:   host = slave3/10.20.11.112*
 
 the namenode (slave3) is trying to run the jobtracker locally but when it
 starts the jobtracker server it binds it to the slave2 address and of course
 fails:
 
 *Problem binding to slave2/10.20.11.166:9001*
 
 What do you guys think could be going wrong?
 
 Thanks!
 Pony
 
 
 
 
 -- 
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434


Re: Starting JobTracker Locally but binding to remote Address

2011-05-31 Thread Konstantin Boudnik
On Tue, May 31, 2011 at 06:21PM, gordoslocos wrote:
 Eeeeh why? Isnt That the config for the jobtracker? Slave2 has been 
 defined in my /etc/hosts files.
 Should those lines not be in both nodes?

Indeed, but you are running MR start script on slave3 meaning that JT will be
started on slave3 whatever the configuration says: start-mapred.sh isn't that
smart and doesn't check your configs.

Cos

 Thanks for helping!
 Pony
 
 On 31/05/2011, at 18:12, Konstantin Boudnik c...@apache.org wrote:
 
  This seems to be your problem, really...
  * namemapred.job.tracker/name*
  * valueslave2:9001/value*
  
  On Tue, May 31, 2011 at 06:07PM, Juan P. wrote:
  Hi Guys,
  I recently configured my cluster to have 2 VMs. I configured 1
  machine (slave3) to be the namenode and another to be the
  jobtracker (slave2). They both work as datanode/tasktracker as well.
  
  Both configs have the following contents in their masters and slaves file:
  *slave2*
  *slave3*
  
  Both machines have the following contents on their mapred-site.xml file:
  *?xml version=1.0?*
  *?xml-stylesheet type=text/xsl href=configuration.xsl?*
  *
  *
  *!-- Put site-specific property overrides in this file. --*
  *
  *
  *configuration*
  * property*
  * namemapred.job.tracker/name*
  * valueslave2:9001/value*
  * /property*
  */configuration*
  
  Both machines have the following contents on their core-site.xml file:
  *?xml version=1.0?*
  *?xml-stylesheet type=text/xsl href=configuration.xsl?*
  *
  *
  *!-- Put site-specific property overrides in this file. --*
  *
  *
  *configuration*
  * property*
  * namefs.default.name/name*
  * valuehdfs://slave3:9000/value*
  * /property*
  */configuration*
  
  When I log into the namenode and I run the start-all.sh script, everything
  but the jobtracker starts. In the log files I get the following exception:
  
  */*
  *STARTUP_MSG: Starting JobTracker*
  *STARTUP_MSG:   host = slave3/10.20.11.112*
  *STARTUP_MSG:   args = []*
  *STARTUP_MSG:   version = 0.20.2*
  *STARTUP_MSG:   build =
  https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
  911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010*
  */*
  *2011-05-31 13:54:06,940 INFO org.apache.hadoop.mapred.JobTracker: 
  Scheduler
  configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
  limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)*
  *2011-05-31 13:54:07,086 FATAL org.apache.hadoop.mapred.JobTracker:
  java.net.BindException: Problem binding to slave2/10.20.11.166:9001 : 
  Cannot
  assign requested address*
  *at org.apache.hadoop.ipc.Server.bind(Server.java:190)*
  *at org.apache.hadoop.ipc.Server$Listener.init(Server.java:253)*
  *at org.apache.hadoop.ipc.Server.init(Server.java:1026)*
  *at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:488)*
  *at org.apache.hadoop.ipc.RPC.getServer(RPC.java:450)*
  *at 
  org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1595)
  *
  *at
  org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)*
  *at
  org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:175)*
  *at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3702)*
  *Caused by: java.net.BindException: Cannot assign requested address*
  *at sun.nio.ch.Net.bind(Native Method)*
  *at
  sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)*
  *at 
  sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
  *
  *at org.apache.hadoop.ipc.Server.bind(Server.java:188)*
  *... 8 more*
  *
  *
  *2011-05-31 13:54:07,096 INFO org.apache.hadoop.mapred.JobTracker:
  SHUTDOWN_MSG:*
  */*
  *SHUTDOWN_MSG: Shutting down JobTracker at slave3/10.20.11.112*
  */*
  
  
  As I see it, from the lines
  
  *STARTUP_MSG: Starting JobTracker*
  *STARTUP_MSG:   host = slave3/10.20.11.112*
  
  the namenode (slave3) is trying to run the jobtracker locally but when it
  starts the jobtracker server it binds it to the slave2 address and of 
  course
  fails:
  
  *Problem binding to slave2/10.20.11.166:9001*
  
  What do you guys think could be going wrong?
  
  Thanks!
  Pony


copyToLocal (from Amazon AWS)

2011-05-31 Thread neeral beladia
Hi,

I am not sure if this question has been asked. Its more of a  hadoop fs 
question. I am trying to execute the following hadoop fs  command :

hadoop fs -copyToLocal s3n://Access  Key:Secret Key@bucket 
name/file.txt /home/hadoop/workspace/file.txt

When I execute this command  directly from the Terminal shell, it works 
perfectly fine, however the  above command from code doesn't execute. In fact, 
it says :

Exception in thread main copyToLocal: null

Please  note I am using Runtime.getRunTime().exec(cmdStr), where cmdStr is the  
above hadoop command. Also, please note that hadoop fs -cp or hadoop fs  -rmr 
commands work fine with source and destination being both Amazon  AWS 
locations. 
In the above command (hadoop fs -copyToLocal)  the destination is local 
location 
to my machine(Ubuntu installed).

Your help would be greatly appreciated.

Thanks,

Neeral

Re: Why don't my jobs get preempted?

2011-05-31 Thread Matei Zaharia
Preemption is only available in Hadoop 0.20+ or in distributions of Hadoop that 
have applied that patch, such as Cloudera's distribution. If you are running 
one of these, check out 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html for 
information on how to enable preemption.

Matei

On May 31, 2011, at 12:20 PM, Edward Capriolo wrote:

 On Tue, May 31, 2011 at 2:50 PM, W.P. McNeill bill...@gmail.com wrote:
 
 I'm launching long-running tasks on a cluster running the Fair Scheduler.
 As I understand it, the Fair Scheduler is preemptive. What I expect to see
 is that my long-running jobs sometimes get killed to make room for other
 people's jobs. This never happens instead my long-running jobs hog mapper
 and reducer slots and starve other people out.
 
 Am I misunderstanding how the Fair Scheduler works?
 
 
 Try adding
 
minSharePreemptionTimeout120/minSharePreemptionTimeout
fairSharePreemptionTimeout180/fairSharePreemptionTimeout
 
 To one of your pools and see if that pool pre-empts other pools



Re: Why don't my jobs get preempted?

2011-05-31 Thread Matei Zaharia
Sorry, I meant 0.21+, not 0.20+ for the Apache releases.

Matei

On May 31, 2011, at 4:05 PM, Matei Zaharia wrote:

 Preemption is only available in Hadoop 0.20+ or in distributions of Hadoop 
 that have applied that patch, such as Cloudera's distribution. If you are 
 running one of these, check out 
 http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html for 
 information on how to enable preemption.
 
 Matei
 
 On May 31, 2011, at 12:20 PM, Edward Capriolo wrote:
 
 On Tue, May 31, 2011 at 2:50 PM, W.P. McNeill bill...@gmail.com wrote:
 
 I'm launching long-running tasks on a cluster running the Fair Scheduler.
 As I understand it, the Fair Scheduler is preemptive. What I expect to see
 is that my long-running jobs sometimes get killed to make room for other
 people's jobs. This never happens instead my long-running jobs hog mapper
 and reducer slots and starve other people out.
 
 Am I misunderstanding how the Fair Scheduler works?
 
 
 Try adding
 
   minSharePreemptionTimeout120/minSharePreemptionTimeout
   fairSharePreemptionTimeout180/fairSharePreemptionTimeout
 
 To one of your pools and see if that pool pre-empts other pools
 



Re: copyToLocal (from Amazon AWS)

2011-05-31 Thread Mapred Learn
try using complete path for where you hadoop binary is present. For eg
/usr/bin/hadoop instead of hadoop...



On Tue, May 31, 2011 at 3:56 PM, neeral beladia neeral_bela...@yahoo.comwrote:

 Hi,

 I am not sure if this question has been asked. Its more of a  hadoop fs
 question. I am trying to execute the following hadoop fs  command :

 hadoop fs -copyToLocal s3n://Access  Key:Secret Key@bucket
 name/file.txt /home/hadoop/workspace/file.txt

 When I execute this command  directly from the Terminal shell, it works
 perfectly fine, however the  above command from code doesn't execute. In
 fact,
 it says :

 Exception in thread main copyToLocal: null

 Please  note I am using Runtime.getRunTime().exec(cmdStr), where cmdStr is
 the
 above hadoop command. Also, please note that hadoop fs -cp or hadoop fs
  -rmr
 commands work fine with source and destination being both Amazon  AWS
 locations.
 In the above command (hadoop fs -copyToLocal)  the destination is local
 location
 to my machine(Ubuntu installed).

 Your help would be greatly appreciated.

 Thanks,

 Neeral


Re: copyToLocal (from Amazon AWS)

2011-05-31 Thread Mapred Learn
Oops.. reading again, the command is working.
what is the exact string that you have in cmdStr ?




On Tue, May 31, 2011 at 4:51 PM, Mapred Learn mapred.le...@gmail.comwrote:

 try using complete path for where you hadoop binary is present. For eg
 /usr/bin/hadoop instead of hadoop...



 On Tue, May 31, 2011 at 3:56 PM, neeral beladia 
 neeral_bela...@yahoo.comwrote:

 Hi,

 I am not sure if this question has been asked. Its more of a  hadoop fs
 question. I am trying to execute the following hadoop fs  command :

 hadoop fs -copyToLocal s3n://Access  Key:Secret Key@bucket
 name/file.txt /home/hadoop/workspace/file.txt

 When I execute this command  directly from the Terminal shell, it works
 perfectly fine, however the  above command from code doesn't execute. In
 fact,
 it says :

 Exception in thread main copyToLocal: null

 Please  note I am using Runtime.getRunTime().exec(cmdStr), where cmdStr is
 the
 above hadoop command. Also, please note that hadoop fs -cp or hadoop fs
  -rmr
 commands work fine with source and destination being both Amazon  AWS
 locations.
 In the above command (hadoop fs -copyToLocal)  the destination is local
 location
 to my machine(Ubuntu installed).

 Your help would be greatly appreciated.

 Thanks,

 Neeral





Re: trying to select technology

2011-05-31 Thread Jane Chen
Hi,

I think you should check out MarkLogic, a product with database and search 
capabilities especially designed for XML and unstructured data.  We also allow 
you to run Hadoop MapReduce jobs on top of data stored in MarkLogic.

For more information on MarkLogic, please check out: 
http://www.marklogic.com/products/overview.html

Thanks,
Jane

--- On Tue, 5/31/11, cs230 chintanjs...@gmail.com wrote:

 From: cs230 chintanjs...@gmail.com
 Subject: trying to select technology
 To: core-u...@hadoop.apache.org
 Date: Tuesday, May 31, 2011, 10:50 AM
 
 Hello All,
 
 I am planning to start project where I have to do extensive
 storage of xml
 and text files. On top of that I have to implement
 efficient algorithm for
 searching over thousands or millions of files, and also do
 some indexes to
 make search faster next time. 
 
 I looked into Oracle database but it delivers very poor
 result. Can I use
 Hadoop for this? Which Hadoop project would be best fit for
 this? 
 
 Is there anything from Google I can use? 
 
 Thanks a lot in advance.
 -- 
 View this message in context: 
 http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
 Sent from the Hadoop core-user mailing list archive at
 Nabble.com.
 
 


DistributedCache - getLocalCacheFiles method returns null

2011-05-31 Thread neeral beladia
Hi,

I have a file on amazon aws under :

s3n://Access Key:Secret Key@Bucket Name/file.txt

I want this file to be accessible by the slave nodes via Distributed Cache.

I put the following after the job configuration statements in the Driver 
program 
:

DistributedCache.addCacheFile(new Path(s3n://Access Key:Secret Key@Bucket 
Name/file.txt).toUri(), job.getConfiguration());

Also in my setup method in the mapper class, I have the below statement :
Path[] cacheFiles = 
DistributedCache.getLocalCacheFiles(context.getConfiguration());

cacheFiles is gettng assigned null.

Could you please let me know what I am doing wrong here ? The file does exist 
on 
S3.

Thanks,

Neeral


Re: trying to select technology

2011-05-31 Thread medcl

my suggestion,
ElasticSearch:http://elasticsearch.org


-原始邮件- 
From: Jane Chen

Sent: Wednesday, June 01, 2011 12:19 PM
To: core-u...@hadoop.apache.org ; common-user@hadoop.apache.org
Subject: Re: trying to select technology

Hi,

I think you should check out MarkLogic, a product with database and search 
capabilities especially designed for XML and unstructured data.  We also 
allow you to run Hadoop MapReduce jobs on top of data stored in MarkLogic.


For more information on MarkLogic, please check out:
http://www.marklogic.com/products/overview.html

Thanks,
Jane

--- On Tue, 5/31/11, cs230 chintanjs...@gmail.com wrote:


From: cs230 chintanjs...@gmail.com
Subject: trying to select technology
To: core-u...@hadoop.apache.org
Date: Tuesday, May 31, 2011, 10:50 AM

Hello All,

I am planning to start project where I have to do extensive
storage of xml
and text files. On top of that I have to implement
efficient algorithm for
searching over thousands or millions of files, and also do
some indexes to
make search faster next time.

I looked into Oracle database but it delivers very poor
result. Can I use
Hadoop for this? Which Hadoop project would be best fit for
this?

Is there anything from Google I can use?

Thanks a lot in advance.
--
View this message in context: 
http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html

Sent from the Hadoop core-user mailing list archive at
Nabble.com.