Using different file systems for Map Reduce job input and output

2008-10-06 Thread Naama Kraus
Hi,

I wanted to know if it is possible to use different file systems for Map
Reduce job input and output.
I.e. have a M/R job input reside on one file system and the M/R output be
written to another file system (e.g. input on HDFS, output on KFS. Input on
HDFS output on local file system, or anything else ...).

Is it possible to somehow specify that through
FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
Or by any other mechanism ?

Thanks, Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales. (Albert
Einstein)


Re: Using different file systems for Map Reduce job input and output

2008-10-06 Thread Amareshwari Sriramadasu

Hi Naama,

Yes. It is possible to specify using the apis

FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath(). 


You can specify the FileSystem uri for the path.

Thanks,
Amareshwari
Naama Kraus wrote:

Hi,

I wanted to know if it is possible to use different file systems for Map
Reduce job input and output.
I.e. have a M/R job input reside on one file system and the M/R output be
written to another file system (e.g. input on HDFS, output on KFS. Input on
HDFS output on local file system, or anything else ...).

Is it possible to somehow specify that through
FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
Or by any other mechanism ?

Thanks, Naama

  




A scalable gallery with hadoop?

2008-10-06 Thread Alberto Cusinato
Hi, I am a new user.I need to develop a huge mediagallery. My reqs in a
nutshell are a high scalability on the number of users, reliability of
users' data (photos, videos, docs, etc.. uploaded by users) and an internal
search engine.
I've seen some posts about the applicability of Hadoop on web apps, mainly
with negative response (ie:
http://www.nabble.com/Hadoop-also-applicable-in-a-web-app-environment--to18836915.html#a18836915
 )
I know about Amazon AWS and I think that maybe S3+EC2 could be a solution
(even if I still don't know how to integrate the search engine), but I would
like to not close the opportunity of using my own HW in the future.
I've seen that Hadoop provides API for using S3 HDFS and so i thought this
(hadoop framework) was the right layer on which my app should be based
(allowing me to store data locally or on S3 without changing the application
layer).
Now that I've configured a clustered environment with Hadoop, I'm not still
so sure about that.
I am a perfect newbie on this topic, so any suggestion is welcome! (maybe
the right framework could be Hbase?)

Thanks,
Alberto.


Re: Using different file systems for Map Reduce job input and output

2008-10-06 Thread Naama Kraus
Thanks ! Naama

On Mon, Oct 6, 2008 at 10:27 AM, Amareshwari Sriramadasu 
[EMAIL PROTECTED] wrote:

 Hi Naama,

 Yes. It is possible to specify using the apis

 FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath().
 You can specify the FileSystem uri for the path.

 Thanks,
 Amareshwari

 Naama Kraus wrote:

 Hi,

 I wanted to know if it is possible to use different file systems for Map
 Reduce job input and output.
 I.e. have a M/R job input reside on one file system and the M/R output be
 written to another file system (e.g. input on HDFS, output on KFS. Input
 on
 HDFS output on local file system, or anything else ...).

 Is it possible to somehow specify that through
 FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
 Or by any other mechanism ?

 Thanks, Naama







-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales. (Albert
Einstein)


Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Dmitry Pushkarev wrote:
Dear hadoop users, 

 


I'm lucky to work in academic environment where information security is not
the question. However, I'm sure that most of the hadoop users aren't. 

 


Here is the question: how secure hadoop is?  (or let's say foolproof)


Right now hadoop is about as secure as NFS. when deployed onto private 
datacentres with good physical security and well set up networks, you 
can control who gets at the data. Without that, you are sharing your 
state with anyone who can issue HTTP and hadoop IPC requests.




 


Here is the answer: http://www.google.com/search?client=opera
http://www.google.com/search?client=operarls=enq=Hadoop+Map/Reduce+Admini
strationsourceid=operaie=utf-8oe=utf-8
rls=enq=Hadoop+Map/Reduce+Administrationsourceid=operaie=utf-8oe=utf-8
not quite.



see also http://www.google.com/search?q=axis+happiness+page  ; pages 
that we add for benefit of the ops team end up sneaking out into the big 
net.


 


What we're seeing here is open hadoop cluster, where anyone who capable of
installing hadoop and changing his username to webcrawl can use their
cluster and read their data, even though firewall is perfectly installed and
ports like ssh are filtered to outsiders. After you've played enough with
data, you can observe that you can submit jobs as well, and these jobs can
execute shell commands. Which is very, very sad.

 


In my view, this significantly limits distributed hadoop applications, where
part of your cluster may reside on EC2 or other distant datacenter, since
you always need to have certain ports open to an array of ip addresses (if
your instances are dynamic) which isn't acceptable if anyone from that ip
range can connect to your cluster.


well, maybe that's a fault of EC2s architecture in which a deployment 
request doesn't include a declaration of the network configuration?




 


Can we propose to developers to introduce some basic user-management and
access controls to help hadoop make one step further towards
production-quality system?




Being an open source project, you can do more than propose, you can help 
build some basic user-management and access controls. As to production 
quality; it is ready for production, albeit in locked down datacentres. 
Which is the primary deployment infrastructure of many of the active 
developers. As in most community-contributed open source projects, if 
you have specific needs beyond what the active developers need, you end 
up implementing them your self.


The big issue with security is that it is all or nothing. Right now it 
is blatantly insecure, so you should not be surprised that anyone has 
access to your files. To actually lock it down, you would need to 
authenticate and possibly encrypt all communications; this adds a lot of 
overhead, which is why it will be avoided in the big datacentres. You 
also need to go to a lot of effort to make sure it is secure across the 
board, with no JSP pages providing accidental privilege escalation, no 
api calls letting you see stuff you shouldn't. Its not like a normal 
feature defect where you can say don't do that; it's not so easy to 
validate using functional tests that test the expected uses of the code. 
This is why securing an application is such a hard thing to do.




--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Hadoop and security.

2008-10-06 Thread Edward Capriolo
You bring up some valid points. This would be a great topic for a
white paper. The first line of defense should be to apply inbound and
outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the   web access. Only a range of source IP's should be allowed to
access the web interfaces. You can do this through SSH tunneling.

Preventing exec commands can be handled with the security manager and
the sandbox. I was thinking to only allow the execution of signed jars
myself but I never implemented it.


Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
Can you explain The location of these splits is semi-arbitrary? What if the 
example was...

AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH


Does this mean the split might be between CCC such that it results in AAA|BBB|C 
and C|DDD for the first line? Is there a way to control this behavior to split 
on my delimiter?


Terrence A. Pietrondi


--- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Sunday, October 5, 2008, 9:26 PM
 Let's say you have one very large input file of the
 form:
 
 A|B|C|D
 E|F|G|H
 ...
 |1|2|3|4
 
 This input file will be broken up into N pieces, where N is
 the number of
 mappers that run.  The location of these splits is
 semi-arbitrary.  This
 means that unless you have one mapper, you won't be
 able to see the entire
 contents of a column in your mapper.  Given that you would
 need one mapper
 to be able to see the entirety of a column, you've now
 essentially reduced
 your problem to a single machine.
 
 You may want to play with the following idea: collect key
 = column_number
 and value = column_contents in your map step.  This
 means that you would be
 able to see the entirety of a column in your reduce step,
 though you're
 still faced with the tasks of shuffling and re-pivoting.
 
 Does this clear up your confusion?  Let me know if
 you'd like me to clarify
 more.
 
 Alex
 
 On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
 [EMAIL PROTECTED]
  wrote:
 
  I am not sure why this doesn't fit, maybe you can
 help me understand. Your
  previous comment was...
 
  The reason I'm making this claim is because
 in order to do the pivot
  operation you must know about every row. Your input
 files will be split at
  semi-arbitrary places, essentially making it
 impossible for each mapper to
  know every single row.
 
  Are you saying that my row segments might not actually
 be the entire row so
  I will get a bad key index? If so, would the row
 segments be determined? I
  based my initial work off of the word count example,
 where the lines are
  tokenized. Does this mean in this example the row
 tokens may not be the
  complete row?
 
  Thanks.
 
  Terrence A. Pietrondi
 
 
  --- On Fri, 10/3/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Friday, October 3, 2008, 7:14 PM
   The approach that you've described does not
 fit well in
   to the MapReduce
   paradigm.  You may want to consider randomizing
 your data
   in a different
   way.
  
   Unfortunately some things can't be solved
 well with
   MapReduce, and I think
   this is one of them.
  
   Can someone else say more?
  
   Alex
  
   On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
Sorry for the confusion, I did make some
 typos. My
   example should have
looked like...
   
 A|B|C
 D|E|G

 pivots too...

 D|A
 E|B
 G|C

 Then for each row, shuffle the contents
 around
   randomly...

 D|A
 B|E
 C|G

 Then pivot the data back...

 A|E|G
 D|B|C
   
The general goal is to shuffle the elements
 in each
   column in the input
data. Meaning, the ordering of the elements
 in each
   column will not be the
same as in input.
   
If you look at the initial input and compare
 to the
   final output, you'll
see that during the shuffling, B and E are
 swapped,
   and G and C are swapped,
while A and D were shuffled back into their
   originating positions in the
column.
   
Once again, sorry for the typos and
 confusion.
   
Terrence A. Pietrondi
   
--- On Fri, 10/3/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 11:01 AM
 Can you confirm that the example
 you've
   presented is
 accurate?  I think you
 may have made some typos, because the
 letter
   G
 isn't in the final result;
 I also think your first pivot
 accidentally
   swapped C and G.
  I'm having a
 hard time understanding what you want
 to do,
   because it
 seems like your
 operations differ from your example.

 With that said, at first glance, this
 problem may
   not fit
 well in to the
 MapReduce paradigm.  The reason I'm
 making
   this claim
 is because in order to
 do the pivot operation you must know
 about every
   row.  Your
 input files will
 be split at semi-arbitrary places,
 essentially
   making it
 impossible for each
 mapper to know every single row.  There
 may be a
   way to do
 this by
 collecting, in your map step, key =
 column
   number (0,
 1, 2, etc) and value
 = (A, B, C, etc), though you 

Re: Hadoop and security.

2008-10-06 Thread Allen Wittenauer



On 10/6/08 6:39 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 Edward Capriolo wrote:
 You bring up some valid points. This would be a great topic for a
 white paper. 
 
 -a wiki page would be a start too

I was thinking about doing Deploying Hadoop Securely for a ApacheCon EU
talk, as by that time, some of the basic Kerberos stuff should be in
place... This whole conversation is starting to reinforce the idea




Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Edward Capriolo wrote:

You bring up some valid points. This would be a great topic for a
white paper. 


-a wiki page would be a start too


The first line of defense should be to apply inbound and

outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the   web access. Only a range of source IP's should be allowed to
access the web interfaces. You can do this through SSH tunneling.

Preventing exec commands can be handled with the security manager and
the sandbox. I was thinking to only allow the execution of signed jars
myself but I never implemented it.



--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Hadoop and security.

2008-10-06 Thread Steve Loughran

Allen Wittenauer wrote:



On 10/6/08 6:39 AM, Steve Loughran [EMAIL PROTECTED] wrote:


Edward Capriolo wrote:

You bring up some valid points. This would be a great topic for a
white paper. 

-a wiki page would be a start too


I was thinking about doing Deploying Hadoop Securely for a ApacheCon EU
talk, as by that time, some of the basic Kerberos stuff should be in
place... This whole conversation is starting to reinforce the idea




-Start with an ApacheCon US fastfeather talk on the current state of 
affairs a hadoop cluster is as secure as a farm of machines running 
NFS. Just to let people know where things stand.


for the EU one, I will probably put in for one on deploying/managing 
using our toolset.


I'm also thinking of a talk datamining a city that looks at what data 
sources a city is already instrumented with, if only you could get at 
them. The hard part is getting at them. I have my eye on our local speed 
camera/red light cameras, that track the speed of every vehicle passing 
and time of day; you could build up a map of traffic velocity, where the 
jams are, when, etc. Getting the machine-parsed number plate data would 
be even more interesting, but governments tend to restrict that data to 
state security, rather than useful things like analysing and predicting 
traffic flow.


Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
As far as I know, splits will never be made within a line, only between
rows.  To answer your question about ways to control the splits, see below:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html


Alex

On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 Can you explain The location of these splits is semi-arbitrary? What if
 the example was...

 AAA|BBB|CCC|DDD
 EEE|FFF|GGG|HHH


 Does this mean the split might be between CCC such that it results in
 AAA|BBB|C and C|DDD for the first line? Is there a way to control this
 behavior to split on my delimiter?


 Terrence A. Pietrondi


 --- On Sun, 10/5/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Sunday, October 5, 2008, 9:26 PM
  Let's say you have one very large input file of the
  form:
 
  A|B|C|D
  E|F|G|H
  ...
  |1|2|3|4
 
  This input file will be broken up into N pieces, where N is
  the number of
  mappers that run.  The location of these splits is
  semi-arbitrary.  This
  means that unless you have one mapper, you won't be
  able to see the entire
  contents of a column in your mapper.  Given that you would
  need one mapper
  to be able to see the entirety of a column, you've now
  essentially reduced
  your problem to a single machine.
 
  You may want to play with the following idea: collect key
  = column_number
  and value = column_contents in your map step.  This
  means that you would be
  able to see the entirety of a column in your reduce step,
  though you're
  still faced with the tasks of shuffling and re-pivoting.
 
  Does this clear up your confusion?  Let me know if
  you'd like me to clarify
  more.
 
  Alex
 
  On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   I am not sure why this doesn't fit, maybe you can
  help me understand. Your
   previous comment was...
  
   The reason I'm making this claim is because
  in order to do the pivot
   operation you must know about every row. Your input
  files will be split at
   semi-arbitrary places, essentially making it
  impossible for each mapper to
   know every single row.
  
   Are you saying that my row segments might not actually
  be the entire row so
   I will get a bad key index? If so, would the row
  segments be determined? I
   based my initial work off of the word count example,
  where the lines are
   tokenized. Does this mean in this example the row
  tokens may not be the
   complete row?
  
   Thanks.
  
   Terrence A. Pietrondi
  
  
   --- On Fri, 10/3/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Friday, October 3, 2008, 7:14 PM
The approach that you've described does not
  fit well in
to the MapReduce
paradigm.  You may want to consider randomizing
  your data
in a different
way.
   
Unfortunately some things can't be solved
  well with
MapReduce, and I think
this is one of them.
   
Can someone else say more?
   
Alex
   
On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 Sorry for the confusion, I did make some
  typos. My
example should have
 looked like...

  A|B|C
  D|E|G
 
  pivots too...
 
  D|A
  E|B
  G|C
 
  Then for each row, shuffle the contents
  around
randomly...
 
  D|A
  B|E
  C|G
 
  Then pivot the data back...
 
  A|E|G
  D|B|C

 The general goal is to shuffle the elements
  in each
column in the input
 data. Meaning, the ordering of the elements
  in each
column will not be the
 same as in input.

 If you look at the initial input and compare
  to the
final output, you'll
 see that during the shuffling, B and E are
  swapped,
and G and C are swapped,
 while A and D were shuffled back into their
originating positions in the
 column.

 Once again, sorry for the typos and
  confusion.

 Terrence A. Pietrondi

 --- On Fri, 10/3/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 11:01 AM
  Can you confirm that the example
  you've
presented is
  accurate?  I think you
  may have made some typos, because the
  letter
G
  isn't in the final result;
  I also think your first pivot
  accidentally
swapped C and G.
   I'm having a
  hard time understanding what you want
  to do,
because it
  seems like your
  operations differ from your example.
 
  With 

nagios to monitor hadoop datanodes!

2008-10-06 Thread Gerardo Velez
Hi Everyone!


I would like to implement Nagios health monitoring of a Hadoop grid.

Some of you have some experience here, do you hace any approach or advice I
could use.

At this time I've been only playing with jsp's files that hadoop has
integrated into it. so I;m not sure if it could be a good idea that
nagios monitor request info to these jsp?


Thanks in advance!


-- Gerardo


Searching Lucene Index built using Hadoop

2008-10-06 Thread Saranath

I'm trying to index a large dataset using Hadoop+Lucene. I used the example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way
to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large datasets.

Thanks,
Saranath
-- 
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: How to GET row name/column name in HBase using JAVA API

2008-10-06 Thread Jean-Daniel Cryans
Please use the HBase mailing list for HBase-related questions:
http://hadoop.apache.org/hbase/mailing_lists.html#Users

Regards your question, have you looked at
http://wiki.apache.org/hadoop/Hbase/HbaseRest ?

J-D

On Mon, Oct 6, 2008 at 12:05 AM, Trinh Tuan Cuong [EMAIL PROTECTED]
 wrote:

 Hello guys,



 I m trying to use Java to manipluate HBase using its API. Atm, I m trying
 to do some simple CRUDs activities with it and from here a little problem
 arise.



 What I was trying to make is : Given an existing table, display/get existed
 row name ( in order to update, manipulate datas ) , columns name ( in String
 ? – or bytes ? ) , I could see that HstoreKey in HBase API allow to get the
 Row values but it wont specific what table(s) it working on ??



 So if any of you could please, show me how to get rows, columns family
 names of a given table, I would thanks in advance.



 Best Regards,



 Trịnh Tuấn Cường



 Luvina Software Company

 Website : www.luvina.net



 Address : 1001 Hoang Quoc Viet Street

 Email : [EMAIL PROTECTED],[EMAIL PROTECTED]

 Mobile : 097 4574 457






Re: Searching Lucene Index built using Hadoop

2008-10-06 Thread Stefan Groschupf

Hi,
you might find http://katta.wiki.sourceforge.net/ interesting. If you  
have any katta releated question please use the katta mailing list.

Stefan

~~~
101tec Inc., Menlo Park, California
web:  http://www.101tec.com
blog: http://www.find23.net



On Oct 6, 2008, at 10:26 AM, Saranath wrote:



I'm trying to index a large dataset using Hadoop+Lucene. I used the  
example
under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to  
find a way

to search the index that was successfully built.

I tried copying over the index to one machine and merging them using
IndexWriter.addIndexesNoOptimize().

I would like hear your input on the best way to index+search large  
datasets.


Thanks,
Saranath
--
View this message in context: 
http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.






Weird problem running wordcount example from within Eclipse

2008-10-06 Thread Ski Gh3
Hi all,

I have a weird problem regarding running the wordcount example from eclipse.

I was able to run the wordcount example from the command line like:
$ ...MyHadoop/bin/hadoop jar ../MyHadoop/hadoop-xx-examples.jar wordcount
myinputdir myoutputdir

However, if I try to run the wordcount program from Eclipse (supplying same
two args: myinputdir myoutputdir)
I got the following error messsage:

Exception in thread main java.lang.RuntimeException: java.io.IOException:
No FileSystem for scheme: file
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:356)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:331)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:304)
at org.apache.hadoop.examples.WordCount.run(WordCount.java:149)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:161)
Caused by: java.io.IOException: No FileSystem for scheme: file
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1277)
at org.apache.hadoop.fs.FileSystem.access$1(FileSystem.java:1273)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:352)
... 5 more

It seems from within Eclipse, the program does not know how to interpret the
myinputdir as a hadoop path?

Can someone please tell me how I can fix this?

Thanks a lot!!!


Questions regarding adding resource via Configuration

2008-10-06 Thread Tarandeep Singh
Hi,

I have a configuration file (similar to hadoop-site.xml) and I want to
include this file as a resource while running Map-Reduce jobs. Similarly, I
want to add a jar file that is required by Mappers and Reducers

ToolRunner.run( ...) allows me to do this easily, my question is can I add
these files permanently? I am running a lot of different Map-Reduce jobs in
a loop, so is there a way I can add these files once and subsequent jobs
need not to add them ?

Also, if I don't implement Tool and don't use ToolRunner.run, but call
Configuration.addresource( ), then will the parameters defined in my
configuration file available in mappers and reducers ?

Thanks,
Taran


Re: Turning off FileSystem statistics during MapReduce

2008-10-06 Thread Nathan Marz
We see this on Maps and only on incrementBytesRead (not on  
incrementBytesWritten). It is on HDFS where we are seeing the time  
spent. It seems that this is because incrementBytesRead is called  
every time a record is read, while incrementBytesWritten is only  
called when a buffer is spilled. We would benefit a lot from being  
able to turn this off.




On Oct 3, 2008, at 6:19 PM, Arun C Murthy wrote:


Nathan,

On Oct 3, 2008, at 5:18 PM, Nathan Marz wrote:


Hello,

We have been doing some profiling of our MapReduce jobs, and we are  
seeing about 20% of the time of our jobs is spent calling  
FileSystem$Statistics.incrementBytesRead when we interact with  
the FileSystem. Is there a way to turn this stats-collection off?




This is interesting... could you provide more details? Are you  
seeing this on Maps or Reduces? Which FileSystem exhibited this i.e.  
HDFS or LocalFS? Any details on about your application?


To answer your original question - no, there isn't a way to disable  
this. However, if this turns out to be a systemic problem we  
definitely should consider having an option to allow users to switch  
it off.


So any information you can provide helps - thanks!

Arun



Thanks,
Nathan Marz
Rapleaf







Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
Hi,

I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.

However when I tried the libjars option to add jar files -

$HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar


I got this error-

java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder

Can someone please tell me what needs to be fixed here ?

Thanks,
Taran


Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Mahadev Konar
HI Tarandeep,
 the libjars options does not add the jar on the client side. Their is an
open jira for that ( id ont remember which one)...

Oyu have to add the jar to the

HADOOP_CLASSPATH on the client side so that it gets picked up on the client
side as well.


mahadev


On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:

 Hi,
 
 I want to add a jar file (that is required by mappers and reducers) to the
 classpath. Initially I had copied the jar file to all the slave nodes in the
 $HADOOP_HOME/lib directory and it was working fine.
 
 However when I tried the libjars option to add jar files -
 
 $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar
 
 
 I got this error-
 
 java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
 
 Can someone please tell me what needs to be fixed here ?
 
 Thanks,
 Taran



Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
So looking at the following mapper...

http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

On line 32, you can see the row split via a delimiter. On line 43, you can see 
that the field index (the column index) is the map key, and the map value is 
the field contents. How is this incorrect? I think this follows your earlier 
suggestion of:

You may want to play with the following idea: collect key = column_number and 
value = column_contents in your map step.

Terrence A. Pietrondi


--- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

 From: Alex Loddengaard [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Monday, October 6, 2008, 12:55 PM
 As far as I know, splits will never be made within a line,
 only between
 rows.  To answer your question about ways to control the
 splits, see below:
 
 http://wiki.apache.org/hadoop/HowManyMapsAndReduces
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
 
 
 Alex
 
 On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
 [EMAIL PROTECTED]
  wrote:
 
  Can you explain The location of these splits is
 semi-arbitrary? What if
  the example was...
 
  AAA|BBB|CCC|DDD
  EEE|FFF|GGG|HHH
 
 
  Does this mean the split might be between CCC such
 that it results in
  AAA|BBB|C and C|DDD for the first line? Is there a way
 to control this
  behavior to split on my delimiter?
 
 
  Terrence A. Pietrondi
 
 
  --- On Sun, 10/5/08, Alex Loddengaard
 [EMAIL PROTECTED] wrote:
 
   From: Alex Loddengaard
 [EMAIL PROTECTED]
   Subject: Re: architecture diagram
   To: core-user@hadoop.apache.org
   Date: Sunday, October 5, 2008, 9:26 PM
   Let's say you have one very large input file
 of the
   form:
  
   A|B|C|D
   E|F|G|H
   ...
   |1|2|3|4
  
   This input file will be broken up into N pieces,
 where N is
   the number of
   mappers that run.  The location of these splits
 is
   semi-arbitrary.  This
   means that unless you have one mapper, you
 won't be
   able to see the entire
   contents of a column in your mapper.  Given that
 you would
   need one mapper
   to be able to see the entirety of a column,
 you've now
   essentially reduced
   your problem to a single machine.
  
   You may want to play with the following idea:
 collect key
   = column_number
   and value = column_contents in your map step.
  This
   means that you would be
   able to see the entirety of a column in your
 reduce step,
   though you're
   still faced with the tasks of shuffling and
 re-pivoting.
  
   Does this clear up your confusion?  Let me know
 if
   you'd like me to clarify
   more.
  
   Alex
  
   On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
 Pietrondi
   [EMAIL PROTECTED]
wrote:
  
I am not sure why this doesn't fit,
 maybe you can
   help me understand. Your
previous comment was...
   
The reason I'm making this claim
 is because
   in order to do the pivot
operation you must know about every row.
 Your input
   files will be split at
semi-arbitrary places, essentially making it
   impossible for each mapper to
know every single row.
   
Are you saying that my row segments might
 not actually
   be the entire row so
I will get a bad key index? If so, would the
 row
   segments be determined? I
based my initial work off of the word count
 example,
   where the lines are
tokenized. Does this mean in this example
 the row
   tokens may not be the
complete row?
   
Thanks.
   
Terrence A. Pietrondi
   
   
--- On Fri, 10/3/08, Alex Loddengaard
   [EMAIL PROTECTED] wrote:
   
 From: Alex Loddengaard
   [EMAIL PROTECTED]
 Subject: Re: architecture diagram
 To: core-user@hadoop.apache.org
 Date: Friday, October 3, 2008, 7:14 PM
 The approach that you've described
 does not
   fit well in
 to the MapReduce
 paradigm.  You may want to consider
 randomizing
   your data
 in a different
 way.

 Unfortunately some things can't be
 solved
   well with
 MapReduce, and I think
 this is one of them.

 Can someone else say more?

 Alex

 On Fri, Oct 3, 2008 at 8:15 AM,
 Terrence A.
   Pietrondi
 [EMAIL PROTECTED]
  wrote:

  Sorry for the confusion, I did
 make some
   typos. My
 example should have
  looked like...
 
   A|B|C
   D|E|G
  
   pivots too...
  
   D|A
   E|B
   G|C
  
   Then for each row, shuffle
 the contents
   around
 randomly...
  
   D|A
   B|E
   C|G
  
   Then pivot the data back...
  
   A|E|G
   D|B|C
 
  The general goal is to shuffle the
 elements
   in each
 column in the input
  data. Meaning, the ordering of the
 elements
   in each
 column will not be the
  same as in input.
 
  If you look at the initial input
 and compare
   

Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Tarandeep Singh
thanks Mahadev for the reply.
So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
all slave machines like before.

One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
-conf option and I am able to query parameters in mapper/reducers. But is
there a way I can query the parameters in my job driver class -

public class jobDriver extends Configured
{
   someMethod( )
   {
  ToolRunner.run( new MyJob( ), commandLineArgs);
  // I want to query parameters present in my conf file here
   }
}

public class MyJob extends Configured implements Tool
{
}

Thanks,
Taran

On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote:

 HI Tarandeep,
  the libjars options does not add the jar on the client side. Their is an
 open jira for that ( id ont remember which one)...

 Oyu have to add the jar to the

 HADOOP_CLASSPATH on the client side so that it gets picked up on the client
 side as well.


 mahadev


 On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:

  Hi,
 
  I want to add a jar file (that is required by mappers and reducers) to
 the
  classpath. Initially I had copied the jar file to all the slave nodes in
 the
  $HADOOP_HOME/lib directory and it was working fine.
 
  However when I tried the libjars option to add jar files -
 
  $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
 jdom.jar
 
 
  I got this error-
 
  java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
 
  Can someone please tell me what needs to be fixed here ?
 
  Thanks,
  Taran




Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
This mapper does follow my original suggestion, though I'm not familiar with
how the delimiter works in this example.  Anyone else?

Alex

On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 So looking at the following mapper...


 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

 On line 32, you can see the row split via a delimiter. On line 43, you can
 see that the field index (the column index) is the map key, and the map
 value is the field contents. How is this incorrect? I think this follows
 your earlier suggestion of:

 You may want to play with the following idea: collect key = column_number
 and value = column_contents in your map step.

 Terrence A. Pietrondi


 --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Monday, October 6, 2008, 12:55 PM
  As far as I know, splits will never be made within a line,
  only between
  rows.  To answer your question about ways to control the
  splits, see below:
 
  http://wiki.apache.org/hadoop/HowManyMapsAndReduces
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
  
 
  Alex
 
  On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Can you explain The location of these splits is
  semi-arbitrary? What if
   the example was...
  
   AAA|BBB|CCC|DDD
   EEE|FFF|GGG|HHH
  
  
   Does this mean the split might be between CCC such
  that it results in
   AAA|BBB|C and C|DDD for the first line? Is there a way
  to control this
   behavior to split on my delimiter?
  
  
   Terrence A. Pietrondi
  
  
   --- On Sun, 10/5/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Sunday, October 5, 2008, 9:26 PM
Let's say you have one very large input file
  of the
form:
   
A|B|C|D
E|F|G|H
...
|1|2|3|4
   
This input file will be broken up into N pieces,
  where N is
the number of
mappers that run.  The location of these splits
  is
semi-arbitrary.  This
means that unless you have one mapper, you
  won't be
able to see the entire
contents of a column in your mapper.  Given that
  you would
need one mapper
to be able to see the entirety of a column,
  you've now
essentially reduced
your problem to a single machine.
   
You may want to play with the following idea:
  collect key
= column_number
and value = column_contents in your map step.
   This
means that you would be
able to see the entirety of a column in your
  reduce step,
though you're
still faced with the tasks of shuffling and
  re-pivoting.
   
Does this clear up your confusion?  Let me know
  if
you'd like me to clarify
more.
   
Alex
   
On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am not sure why this doesn't fit,
  maybe you can
help me understand. Your
 previous comment was...

 The reason I'm making this claim
  is because
in order to do the pivot
 operation you must know about every row.
  Your input
files will be split at
 semi-arbitrary places, essentially making it
impossible for each mapper to
 know every single row.

 Are you saying that my row segments might
  not actually
be the entire row so
 I will get a bad key index? If so, would the
  row
segments be determined? I
 based my initial work off of the word count
  example,
where the lines are
 tokenized. Does this mean in this example
  the row
tokens may not be the
 complete row?

 Thanks.

 Terrence A. Pietrondi


 --- On Fri, 10/3/08, Alex Loddengaard
[EMAIL PROTECTED] wrote:

  From: Alex Loddengaard
[EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Friday, October 3, 2008, 7:14 PM
  The approach that you've described
  does not
fit well in
  to the MapReduce
  paradigm.  You may want to consider
  randomizing
your data
  in a different
  way.
 
  Unfortunately some things can't be
  solved
well with
  MapReduce, and I think
  this is one of them.
 
  Can someone else say more?
 
  Alex
 
  On Fri, Oct 3, 2008 at 8:15 AM,
  Terrence A.
Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Sorry for the confusion, I did
  make some
typos. My
  example should have
   looked like...
  
A|B|C
D|E|G
   
pivots too...
   
D|A
E|B
G|C
   
Then for each row, shuffle
  the contents

Re: is 12 minutes ok for dfs chown -R on 45000 files ?

2008-10-06 Thread Allen Wittenauer



On 10/2/08 11:33 PM, Frank Singleton [EMAIL PROTECTED] wrote:

 Just to clarify, this is for when the chown will modify all files owner
 attributes
 
 eg: toggle all from frank:frank to hadoop:hadoop (see below)

When we converted from 0.15 to 0.16, we chown'ed all of our files.  The
local dev team wrote the code in
https://issues.apache.org/jira/browse/HADOOP-3052 , but it wasn't committed
as a standard feature as they viewed this as a one off. :(

Needless to say, running a large chown as a MR job should be
significantly faster.



Why is super user privilege required for FS statistics?

2008-10-06 Thread Brian Bockelman

Hey all,

I noticed something really funny about fuse-dfs: because super-user  
privileges are required to run the getStats function in  
FSNamesystem.java, my file systems show up as having 16 exabytes total  
and 0 bytes free.  If I mount fuse-dfs as root, then I get the correct  
results from df.


Is this an oversight?  Is there any good reason I shouldn't file a bug  
to make the getStats command (which only returns the used, free, and  
total space in the file system) not require superuser privilege?


Brian

smime.p7s
Description: S/MIME cryptographic signature


Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Andy Li
Dears,

Sorry, I did not mean to cross post.  But the previous article was
accidentally posted to the HBase user list.  I would like to bring it back
to the Hadoop user since it is confusing me a lot and it is mainly MapReduce
related.

Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
Capacity is 92.  When I do this in my MapReduce program:

= SAMPLE CODE =
JobConf jconf = new JobConf(conf, TestTask.class);
jconf.setJobName(my.test.TestTask);
jconf.setOutputKeyClass(Text.class);
jconf.setOutputValueClass(Text.class);
jconf.setOutputFormat(TextOutputFormat.class);
jconf.setMapperClass(MyMapper.class);
jconf.setCombinerClass(MyReducer.class);
jconf.setReducerClass(MyReducer.class);
jconf.setInputFormat(TextInputFormat.class);
try {
jconf.setNumMapTasks(5);
jconf.setNumReduceTasks(3);
JobClient.runJob(jconf);
} catch (Exception e) {
e.printStackTrace();
}
= = =

When I run the job, I'm always getting 300 mappers and 1 reducers from the
JobTracker webpage running on the default port 50030.
No matter how I configure the numbers in methods setNumMapTasks and
setNumReduceTasks, I get the same result.
Does anyone know why this is happening?
Am I missing something or misunderstand something in the picture?  =(

Here's a reference to the parameters we have override in hadoop-site.xml.
===
property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
/property

other parameters are default from hadoop-default.xml.

Any idea how this is happening?

Any inputs are appreciated.

Thanks,
-Andy


Re: architecture diagram

2008-10-06 Thread Samuel Guo
I think what Alex talked about 'split' is the mapreduce system's action.
What you said about 'split' is your mapper's action.

I guess that your map/reduce application uses *TextInputFormat* to treat
your input file.

your input file will first be splitted into a few splits. these splits may
be like filename, offset, length. What Alex said about 'The location of
these splits is semi-arbitrary' means that the file split's offset in your
input file is semi-arbitrary. Am I right, Alex?
Then *TextInputFormat* will translate these file splits into a sequence of
lines, where offset is treated as key and line is treated as value.

As these file splits are splitted by offset. Some lines in your file may be
splitted into different file splits. A *LineRecordReader* used by
*TextInputFormat* will remove the half-baked line in these file splits to
make sure that every mapper will get integrated lines one by one.

For examples:

a file as below:

AAA BBB CCC DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


it may be splitted into two file splits(we assume that there are two
mappers.).
split one:

AAA BBB CCC

split two:
DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


take split two as example:
TextInputFormat will use LineRecordReader to translate split two into a
sequence of offset, line pairs, and it will skip the first half-baked line
DDD. so the sequence will be:
offset1, EEE FFF GGG HHH
offset2, AAA BBB CCC DDD


Then what to do with the lines depends on your job.


On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi [EMAIL PROTECTED]
 wrote:

 So looking at the following mapper...


 http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

 On line 32, you can see the row split via a delimiter. On line 43, you can
 see that the field index (the column index) is the map key, and the map
 value is the field contents. How is this incorrect? I think this follows
 your earlier suggestion of:

 You may want to play with the following idea: collect key = column_number
 and value = column_contents in your map step.

 Terrence A. Pietrondi


 --- On Mon, 10/6/08, Alex Loddengaard [EMAIL PROTECTED] wrote:

  From: Alex Loddengaard [EMAIL PROTECTED]
  Subject: Re: architecture diagram
  To: core-user@hadoop.apache.org
  Date: Monday, October 6, 2008, 12:55 PM
  As far as I know, splits will never be made within a line,
  only between
  rows.  To answer your question about ways to control the
  splits, see below:
 
  http://wiki.apache.org/hadoop/HowManyMapsAndReduces
  
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
  
 
  Alex
 
  On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
  [EMAIL PROTECTED]
   wrote:
 
   Can you explain The location of these splits is
  semi-arbitrary? What if
   the example was...
  
   AAA|BBB|CCC|DDD
   EEE|FFF|GGG|HHH
  
  
   Does this mean the split might be between CCC such
  that it results in
   AAA|BBB|C and C|DDD for the first line? Is there a way
  to control this
   behavior to split on my delimiter?
  
  
   Terrence A. Pietrondi
  
  
   --- On Sun, 10/5/08, Alex Loddengaard
  [EMAIL PROTECTED] wrote:
  
From: Alex Loddengaard
  [EMAIL PROTECTED]
Subject: Re: architecture diagram
To: core-user@hadoop.apache.org
Date: Sunday, October 5, 2008, 9:26 PM
Let's say you have one very large input file
  of the
form:
   
A|B|C|D
E|F|G|H
...
|1|2|3|4
   
This input file will be broken up into N pieces,
  where N is
the number of
mappers that run.  The location of these splits
  is
semi-arbitrary.  This
means that unless you have one mapper, you
  won't be
able to see the entire
contents of a column in your mapper.  Given that
  you would
need one mapper
to be able to see the entirety of a column,
  you've now
essentially reduced
your problem to a single machine.
   
You may want to play with the following idea:
  collect key
= column_number
and value = column_contents in your map step.
   This
means that you would be
able to see the entirety of a column in your
  reduce step,
though you're
still faced with the tasks of shuffling and
  re-pivoting.
   
Does this clear up your confusion?  Let me know
  if
you'd like me to clarify
more.
   
Alex
   
On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
  Pietrondi
[EMAIL PROTECTED]
 wrote:
   
 I am not sure why this doesn't fit,
  maybe you can
help me understand. Your
 previous comment was...

 The reason I'm making this claim
  is because
in order to do the pivot
 operation you must know about every row.
  Your input
files will be split at
 semi-arbitrary places, essentially making it
impossible for each mapper to
 know every single row.

 Are you saying that my row segments might
  not actually
be the entire row so
 I will get 

Re: Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Samuel Guo
Mapper's Number depends on your inputformat.
Default Inputformat try to treat every file block of a file as a InputSplit.
And you will get the same number of mappers as the number of your
inputsplits.
try to configure mapred.min.split.size to reduce the number of your mapper
if you want to.

And I don't know why your reducer is just one. Anyone knows?

On Tue, Oct 7, 2008 at 9:06 AM, Andy Li [EMAIL PROTECTED] wrote:

 Dears,

 Sorry, I did not mean to cross post.  But the previous article was
 accidentally posted to the HBase user list.  I would like to bring it back
 to the Hadoop user since it is confusing me a lot and it is mainly
 MapReduce
 related.

 Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
 Capacity is 92.  When I do this in my MapReduce program:

 = SAMPLE CODE =
JobConf jconf = new JobConf(conf, TestTask.class);
jconf.setJobName(my.test.TestTask);
jconf.setOutputKeyClass(Text.class);
jconf.setOutputValueClass(Text.class);
jconf.setOutputFormat(TextOutputFormat.class);
jconf.setMapperClass(MyMapper.class);
jconf.setCombinerClass(MyReducer.class);
jconf.setReducerClass(MyReducer.class);
jconf.setInputFormat(TextInputFormat.class);
try {
jconf.setNumMapTasks(5);
jconf.setNumReduceTasks(3);
JobClient.runJob(jconf);
} catch (Exception e) {
e.printStackTrace();
}
 = = =

 When I run the job, I'm always getting 300 mappers and 1 reducers from the
 JobTracker webpage running on the default port 50030.
 No matter how I configure the numbers in methods setNumMapTasks and
 setNumReduceTasks, I get the same result.
 Does anyone know why this is happening?
 Am I missing something or misunderstand something in the picture?  =(

 Here's a reference to the parameters we have override in hadoop-site.xml.
 ===
 property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
 /property

 property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
 /property
 
 other parameters are default from hadoop-default.xml.

 Any idea how this is happening?

 Any inputs are appreciated.

 Thanks,
 -Andy



Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Taeho Kang
Adding your jar files in the $HADOOP_HOME/lib folder works, but you would
have to restart all your tasktrackers to have your jar files loaded.

If you repackage your map-reduce jar file (e.g. hadoop-0.18.0-examples.jar)
with your jar file and run your job with the newly repackaged jar file, it
would work, too.

On Tue, Oct 7, 2008 at 6:55 AM, Tarandeep Singh [EMAIL PROTECTED] wrote:

 thanks Mahadev for the reply.
 So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
 all slave machines like before.

 One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
 -conf option and I am able to query parameters in mapper/reducers. But is
 there a way I can query the parameters in my job driver class -

 public class jobDriver extends Configured
 {
   someMethod( )
   {
  ToolRunner.run( new MyJob( ), commandLineArgs);
  // I want to query parameters present in my conf file here
   }
 }

 public class MyJob extends Configured implements Tool
 {
 }

 Thanks,
 Taran

 On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED]
 wrote:

  HI Tarandeep,
   the libjars options does not add the jar on the client side. Their is an
  open jira for that ( id ont remember which one)...
 
  Oyu have to add the jar to the
 
  HADOOP_CLASSPATH on the client side so that it gets picked up on the
 client
  side as well.
 
 
  mahadev
 
 
  On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
 
   Hi,
  
   I want to add a jar file (that is required by mappers and reducers) to
  the
   classpath. Initially I had copied the jar file to all the slave nodes
 in
  the
   $HADOOP_HOME/lib directory and it was working fine.
  
   However when I tried the libjars option to add jar files -
  
   $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
  jdom.jar
  
  
   I got this error-
  
   java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
  
   Can someone please tell me what needs to be fixed here ?
  
   Thanks,
   Taran
 
 



Re: nagios to monitor hadoop datanodes!

2008-10-06 Thread Taeho Kang
The easiest approach I can think of is to write a simple Nagios plugin that
checks if the datanode JVM process is alive. Or you may
write a Nagios-plugin that checks for error or warning messages in datanode
logs. (I am sure you can find quite a few log-checking Nagios plugin in
nagiosplugin.org)

If you are unsure of how to write nagios-plugin, I suggest you to read stuff
from link Leverage Nagios with plug-ins you write
http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good
explanations and examples on how to write nagios plugin.

Or if you've got time to burn, you might want to read Nagios documentation,
too.

Let me know if you need help on this matter.

/Taeho



On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Everyone!


 I would like to implement Nagios health monitoring of a Hadoop grid.

 Some of you have some experience here, do you hace any approach or advice I
 could use.

 At this time I've been only playing with jsp's files that hadoop has
 integrated into it. so I;m not sure if it could be a good idea that
 nagios monitor request info to these jsp?


 Thanks in advance!


 -- Gerardo



Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Mahadev Konar
You can just add the jar to the env variable HADOOP_CLASSPATH

If using bash 

Just do this : 

Export HADOOP_CLASSPATH=path  to your class path on the client

And then use the libjars option.

mahadev


On 10/6/08 2:55 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:

 thanks Mahadev for the reply.
 So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
 all slave machines like before.
 
 One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
 -conf option and I am able to query parameters in mapper/reducers. But is
 there a way I can query the parameters in my job driver class -
 
 public class jobDriver extends Configured
 {
someMethod( )
{
   ToolRunner.run( new MyJob( ), commandLineArgs);
   // I want to query parameters present in my conf file here
}
 }
 
 public class MyJob extends Configured implements Tool
 {
 }
 
 Thanks,
 Taran
 
 On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED] wrote:
 
 HI Tarandeep,
  the libjars options does not add the jar on the client side. Their is an
 open jira for that ( id ont remember which one)...
 
 Oyu have to add the jar to the
 
 HADOOP_CLASSPATH on the client side so that it gets picked up on the client
 side as well.
 
 
 mahadev
 
 
 On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I want to add a jar file (that is required by mappers and reducers) to
 the
 classpath. Initially I had copied the jar file to all the slave nodes in
 the
 $HADOOP_HOME/lib directory and it was working fine.
 
 However when I tried the libjars option to add jar files -
 
 $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
 jdom.jar
 
 
 I got this error-
 
 java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
 
 Can someone please tell me what needs to be fixed here ?
 
 Thanks,
 Taran
 
 



Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Amareshwari Sriramadasu

Hi,

From 0.19, the jars added using -libjars are available on the client 
classpath also, fixed by HADOOP-3570.


Thanks
Amareshwari

Mahadev Konar wrote:

HI Tarandeep,
 the libjars options does not add the jar on the client side. Their is an
open jira for that ( id ont remember which one)...

Oyu have to add the jar to the

HADOOP_CLASSPATH on the client side so that it gets picked up on the client
side as well.


mahadev


On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:

  

Hi,

I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.

However when I tried the libjars option to add jar files -

$HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar


I got this error-

java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder

Can someone please tell me what needs to be fixed here ?

Thanks,
Taran