date:20120301

On Wed, Feb 29, 2012 at 11:34 PM, W.P. McNeill bill...@gmail.com wrote:

 I can do perform HDFS operations from the command line like hadoop fs -ls
 /. Doesn't that meant that the datanode is up?


  No. It is just meta data lookup which comes from Namenode. Try to cat
some file like hadoop fs -cat  . Then if you are able get data then
datanode should be up . Also make sure that hdfs is not in safemode . To
turnoff safemode use hdfs command hadoop dfsadmin -safemode leave and
then restart the jobtracker and tasktracker.



-- 
Join me at http://hadoopworkshop.eventbrite.com/

Distributed Indexing on MapReduce

2012-03-01 Thread Frank Scholten

Hi all,

I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944

What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.

I have been going through the Nutch and contrib/index code and from my
understanding I have to:

* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?

I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS

How do I set this up?

Cheers,

Frank

Re: Streaming Hadoop using C

2012-03-01 Thread Charles Earl

How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote:

I assume you have also just tried running locally and using the jdk
performance tools (e.g. jmap) to gain insight by configuring hadoop to run
absolute minimum number of tasks?
Perhaps the discussion

http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

I've used hadoop profiling (.prof) to show the stack trace but it was
hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to
monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark

On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com
wrote:

Mark,
So if I understand, it is more the memory management that you are
interested in, rather than a need to run an existing C or C++
application
in MapReduce platform?
Have you done profiling of the application?
C
On Feb 29, 2012, at 2:19 PM, Mark question wrote:

Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best
input
size for map task. So you're saying piping can help me control memory
even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Mark,
Both streaming and pipes allow this, perhaps more so pipes at the
level
of
the mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

Hi guys, thought I should ask this before I use it ... will using C
over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned
into
bytecode, but I need more control on memory which obviously is hard
for
me
to do with Java.

Let me know of any advantages you know about streaming in C over
hadoop.
Thank you,
Mark

Re: Should splittable Gzip be a core hadoop feature?

2012-03-01 Thread Michel Segel


 I do agree that a git hub project is the way to go unless you could convince 
Cloudera, HortonWorks or MapR to pick it up and support it.  They have enough 
committers 

Is this potentially worthwhile? Maybe, it depends on how the cluster is 
integrated in to the overall environment. Companies that have standardized on 
using gzip would find it useful.



Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 29, 2012, at 3:17 PM, Niels Basjes ni...@basjes.nl wrote:

 Hi,
 
 On Wed, Feb 29, 2012 at 19:13, Robert Evans ev...@yahoo-inc.com wrote:
 
 
 What I really want to know is how well does this new CompressionCodec
 perform in comparison to the regular gzip codec in
 
 various different conditions and what type of impact does it have on
 network traffic and datanode load.  My gut feeling is that
 
 the speedup is going to be relatively small except when there is a lot of
 computation happening in the mapper
 
 
 I agree, I made the same assesment.
 In the javadoc I wrote under When is this useful?
 *Assume you have a heavy map phase for which the input is a 1GiB Apache
 httpd logfile. Now assume this map takes 60 minutes of CPU time to run.*
 
 
 and the added load and network traffic outweighs the speedup in most
 cases,
 
 
 No, the trick to solve that one is to upload the gzipped files with a HDFS
 blocksize equal (or 1 byte larger) than the filesize.
 This setting will help in speeding up Gzipped input files in any situation
 (no more network overhead).
 From there the HDFS file replication factor of the file dictates the
 optimal number of splits for this codec.
 
 
 but like all performance on a complex system gut feelings are
 
 almost worthless and hard numbers are what is needed to make a judgment
 call.
 
 
 Yes
 
 
 Niels, I assume you have tested this on your cluster(s).  Can you share
 with us some of the numbers?
 
 
 No I haven't tested it beyond a multiple core system.
 The simple reason for that is that when this was under review last summer
 the whole Yarn thing happened
 and I was unable to run it at all for a long time.
 I only got it running again last december when the restructuring of the
 source tree was mostly done.
 
 At this moment I'm building a experimentation setup at work that can be
 used for various things.
 Given the current state of Hadoop 2.0 I think it's time to produce some
 actual results.
 
 -- 
 Best regards / Met vriendelijke groeten,
 
 Niels Basjes

Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Merto Mertek

From the fairscheduler docs I assume the following should work:

property
 namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
/property

property
  namepool.name/name
  value${mapreduce.job.group.name}/value
/property

which means that the default pool will be the group of the user that has
submitted the job. In your case I think that allocations.xml is correct. If
you want to explicitly define a job to specific pool from your
allocation.xml file you can define it as follows:

Configuration conf3 = conf;
conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

Let me know if it works..


On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

 How can I set the fair scheduler such that all jobs submitted from a
 particular user group go to a pool with the group name?

 I have setup fair scheduler and I have two users: A and B (belonging to the
 user group hadoop)

 When these users submit hadoop jobs, the jobs from A got to a pool named A
 and the jobs from B go to a pool named B.
  I want them to go to a pool with their group name, So I tried adding the
 following to mapred-site.xml:

 property
  namemapred.fairscheduler.poolnameproperty/name
 valuegroup.name/value
 /property

 But instead the jobs now go to the default pool.
 I want the jobs submitted by A and B to go to the pool named hadoop. How
 do I do that?
 also how can I explicity set a job to any specified pool?

 I have set the allocation file (fair-scheduler.xml) like this:

 allocations
  pool name=hadoop
minMaps1/minMaps
minReduces1/minReduces
maxMaps3/maxMaps
maxReduces3/maxReduces
  /pool
  userMaxJobsDefault5/userMaxJobsDefault
 /allocations

 Any help is greatly appreciated.
 Thanks,
 Austin

RE: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Dave Shine

I've just started playing with the Fair Scheduler.  To specify the pool at job 
submission time you set the mapred.fairscheduler.pool property on the Job 
Conf to the name of the pool you want the job to use.

Dave


-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com]
Sent: Thursday, March 01, 2012 9:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool

From the fairscheduler docs I assume the following should work:

property
 namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
/property

property
  namepool.name/name
  value${mapreduce.job.group.name}/value
/property

which means that the default pool will be the group of the user that has 
submitted the job. In your case I think that allocations.xml is correct. If you 
want to explicitly define a job to specific pool from your allocation.xml file 
you can define it as follows:

Configuration conf3 = conf;
conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

Let me know if it works..


On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

 How can I set the fair scheduler such that all jobs submitted from a
 particular user group go to a pool with the group name?

 I have setup fair scheduler and I have two users: A and B (belonging
 to the user group hadoop)

 When these users submit hadoop jobs, the jobs from A got to a pool
 named A and the jobs from B go to a pool named B.
  I want them to go to a pool with their group name, So I tried adding
 the following to mapred-site.xml:

 property
  namemapred.fairscheduler.poolnameproperty/name
 valuegroup.name/value
 /property

 But instead the jobs now go to the default pool.
 I want the jobs submitted by A and B to go to the pool named hadoop.
 How do I do that?
 also how can I explicity set a job to any specified pool?

 I have set the allocation file (fair-scheduler.xml) like this:

 allocations
  pool name=hadoop
minMaps1/minMaps
minReduces1/minReduces
maxMaps3/maxMaps
maxReduces3/maxReduces
  /pool
  userMaxJobsDefault5/userMaxJobsDefault
 /allocations

 Any help is greatly appreciated.
 Thanks,
 Austin


The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.

Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath

Thanks,
I will be trying the suggestions and will get back to you soon.

On Thu, Mar 1, 2012 at 8:09 PM, Dave Shine 
dave.sh...@channelintelligence.com wrote:

 I've just started playing with the Fair Scheduler.  To specify the pool at
 job submission time you set the mapred.fairscheduler.pool property on the
 Job Conf to the name of the pool you want the job to use.

 Dave


 -Original Message-
 From: Merto Mertek [mailto:masmer...@gmail.com]
 Sent: Thursday, March 01, 2012 9:33 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool

 From the fairscheduler docs I assume the following should work:

 property
  namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
 /property

 property
  namepool.name/name
  value${mapreduce.job.group.name}/value
 /property

 which means that the default pool will be the group of the user that has
 submitted the job. In your case I think that allocations.xml is correct. If
 you want to explicitly define a job to specific pool from your
 allocation.xml file you can define it as follows:

 Configuration conf3 = conf;
 conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

 Let me know if it works..


 On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

  How can I set the fair scheduler such that all jobs submitted from a
  particular user group go to a pool with the group name?
 
  I have setup fair scheduler and I have two users: A and B (belonging
  to the user group hadoop)
 
  When these users submit hadoop jobs, the jobs from A got to a pool
  named A and the jobs from B go to a pool named B.
   I want them to go to a pool with their group name, So I tried adding
  the following to mapred-site.xml:
 
  property
   namemapred.fairscheduler.poolnameproperty/name
  valuegroup.name/value
  /property
 
  But instead the jobs now go to the default pool.
  I want the jobs submitted by A and B to go to the pool named hadoop.
  How do I do that?
  also how can I explicity set a job to any specified pool?
 
  I have set the allocation file (fair-scheduler.xml) like this:
 
  allocations
   pool name=hadoop
 minMaps1/minMaps
 minReduces1/minReduces
 maxMaps3/maxMaps
 maxReduces3/maxReduces
   /pool
   userMaxJobsDefault5/userMaxJobsDefault
  /allocations
 
  Any help is greatly appreciated.
  Thanks,
  Austin
 

 The information contained in this email message is considered confidential
 and proprietary to the sender and is intended solely for review and use by
 the named recipient. Any unauthorized review, use or distribution is
 strictly prohibited. If you have received this message in error, please
 advise the sender by reply email and delete the message.

Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath

Hi,
I tried what you had said. I added the following to mapred-site.xml:


property
 namemapred.fairscheduler.poolnameproperty/name
  valuepool.name/value
/property

property
 namepool.name/name
 value${mapreduce.job.group.name}/value
/property

Funny enough it created a pool with the name ${mapreduce.job.group.name}
so I tried ${mapred.job.group.name} and ${group.name} all to the same
effect.

But when I did ${user.name} it worked! and created a pool with the user
name.



On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek masmer...@gmail.com wrote:

 From the fairscheduler docs I assume the following should work:

 property
  namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
 /property

 property
  namepool.name/name
  value${mapreduce.job.group.name}/value
 /property

 which means that the default pool will be the group of the user that has
 submitted the job. In your case I think that allocations.xml is correct. If
 you want to explicitly define a job to specific pool from your
 allocation.xml file you can define it as follows:

 Configuration conf3 = conf;
 conf3.set(pool.name, pool3); // conf.set(propriety.name, value)

 Let me know if it works..


 On 29 February 2012 14:18, Austin Chungath austi...@gmail.com wrote:

  How can I set the fair scheduler such that all jobs submitted from a
  particular user group go to a pool with the group name?
 
  I have setup fair scheduler and I have two users: A and B (belonging to
 the
  user group hadoop)
 
  When these users submit hadoop jobs, the jobs from A got to a pool named
 A
  and the jobs from B go to a pool named B.
   I want them to go to a pool with their group name, So I tried adding the
  following to mapred-site.xml:
 
  property
   namemapred.fairscheduler.poolnameproperty/name
  valuegroup.name/value
  /property
 
  But instead the jobs now go to the default pool.
  I want the jobs submitted by A and B to go to the pool named hadoop.
 How
  do I do that?
  also how can I explicity set a job to any specified pool?
 
  I have set the allocation file (fair-scheduler.xml) like this:
 
  allocations
   pool name=hadoop
 minMaps1/minMaps
 minReduces1/minReduces
 maxMaps3/maxMaps
 maxReduces3/maxReduces
   /pool
   userMaxJobsDefault5/userMaxJobsDefault
  /allocations
 
  Any help is greatly appreciated.
  Thanks,
  Austin

kill -QUIT

When I try kill -QUIT for a job it doesn't send the stacktrace to the log
files. Does anyone know why or if I am doing something wrong?

I find the job using ps -ef|grep attempt. I then go to
logs/userLogs/jobid/attemptid/

High quality hadoop logo?

2012-03-01 Thread Keith Wiley

Is there a high quality version of the hadoop logo anywhere?  Even the graphic 
presented on the Apache page itself suffers from dreadful jpeg artifacting.  A 
google image search didn't inspire much hope on this issue (they all have the 
same low-quality jpeg appearance).  I'm looking for good graphics for slides, 
presentations, publications, etc.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley

Re: High quality hadoop logo?

2012-03-01 Thread Keith Wiley

Sorry, false alarm.  I was looking at the popup thumbnails in google image 
search.  If I click all the way through, there are some high quality versions 
available.  Why is the version on the Apache site (and the Wikipedia page) so 
poor?

On Mar 1, 2012, at 14:09 , Keith Wiley wrote:

 Is there a high quality version of the hadoop logo anywhere?  Even the 
 graphic presented on the Apache page itself suffers from dreadful jpeg 
 artifacting.  A google image search didn't inspire much hope on this issue 
 (they all have the same low-quality jpeg appearance).  I'm looking for good 
 graphics for slides, presentations, publications, etc.
 
 Thanks.
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
 can
 itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
 can't
 scratch. All together this implies: He scratched the itch from the scratch 
 that
 itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley
 
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland

Re: High quality hadoop logo?

2012-03-01 Thread Owen O'Malley

On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley kwi...@keithwiley.com wrote:
 Sorry, false alarm.  I was looking at the popup thumbnails in google image 
 search.  If I click all the way through, there are some high quality
 versions available.  Why is the version on the Apache site (and the Wikipedia 
 page) so poor?

The high resolution images are in subversion:

http://svn.apache.org/repos/asf/hadoop/logos/

-- Owen

Re: Streaming Hadoop using C

2012-03-01 Thread Mark question

Starfish worked great for wordcount .. I didn't run it on my application
because I have only map tasks.

Mark

On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote:

How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com
wrote:

I assume you have also just tried running locally and using the jdk
performance tools (e.g. jmap) to gain insight by configuring hadoop to
run
absolute minimum number of tasks?
Perhaps the discussion

http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

So basically, is there a way to assign a port to child processes to
monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark

On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Thanks Charles .. I'm running Hadoop for research to perform
duplicate
detection methods. To go deeper, I need to understand what's slowing
my
program, which usually starts with analyzing memory to predict best
input
size for map task. So you're saying piping can help me control memory
even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl
charles.ce...@gmail.com
wrote:

Mark,
Both streaming and pipes allow this, perhaps more so pipes at the
level
of
the mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

Hi guys, thought I should ask this before I use it ... will using C
over
Hadoop give me the usual C memory management? For example,
malloc() ,
sizeof() ? My guess is no since this all will eventually be turned
into
bytecode, but I need more control on memory which obviously is hard
for
me
to do with Java.

Let me know of any advantages you know about streaming in C over
hadoop.
Thank you,
Mark

Re: High quality hadoop logo?

2012-03-01 Thread Keith Wiley

Excellent!

Thank you.

Sent from my phone, please excuse my brevity.
Keith Wiley, kwi...@keithwiley.com, http://keithwiley.com


Owen O'Malley omal...@apache.org wrote:

On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley kwi...@keithwiley.com wrote:
 Sorry, false alarm.  I was looking at the popup thumbnails in google image 
 search.  If I click all the way through, there are some high quality
 versions available.  Why is the version on the Apache site (and the Wikipedia 
 page) so poor?

The high resolution images are in subversion:

http://svn.apache.org/repos/asf/hadoop/logos/

-- Owen

Adding nodes

Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:

http://wiki.apache.org/hadoop/FAQ

1. Update conf/slave
2. on the slave nodes start datanode and tasktracker
3. hadoop balancer

Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

2012-03-01 Thread Joey Echeverria

You only have to refresh nodes if you're making use of an allows file. 

Sent from my iPhone

On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:

 Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:
 
 http://wiki.apache.org/hadoop/FAQ
 
 1. Update conf/slave
 2. on the slave nodes start datanode and tasktracker
 3. hadoop balancer
 
 Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote:

 You only have to refresh nodes if you're making use of an allows file.

 Thanks does it mean that when tasktracker/datanode starts up it
communicates with namenode using master file?

Sent from my iPhone

 On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:

  Is this the right procedure to add nodes? I took some from hadoop wiki
 FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

2012-03-01 Thread Joey Echeverria

Not quite. Datanodes get the namenode host from fs.defalt.name in 
core-site.xml. Task trackers find the job tracker from the mapred.job.tracker 
setting in mapred-site.xml. 

Sent from my iPhone

On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote:
 
 You only have to refresh nodes if you're making use of an allows file.
 
 Thanks does it mean that when tasktracker/datanode starts up it
 communicates with namenode using master file?
 
 Sent from my iPhone
 
 On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:
 
 Is this the right procedure to add nodes? I took some from hadoop wiki
 FAQ:
 
 http://wiki.apache.org/hadoop/FAQ
 
 1. Update conf/slave
 2. on the slave nodes start datanode and tasktracker
 3. hadoop balancer
 
 Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan

The master and slave files, if I remember correctly are used to start the 
correct daemons on the correct nodes from the master node.

Raj

 From: Joey Echeverria j...@cloudera.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Thursday, March 1, 2012 4:57 PM
Subject: Re: Adding nodes

Not quite. Datanodes get the namenode host from fs.defalt.name in 
core-site.xml. Task trackers find the job tracker from the mapred.job.tracker 
setting in mapred-site.xml. 

Sent from my iPhone

On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote:

 You only have to refresh nodes if you're making use of an allows file.

 Thanks does it mean that when tasktracker/datanode starts up it
 communicates with namenode using master file?

 Sent from my iPhone

 On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:

 Is this the right procedure to add nodes? I took some from hadoop wiki
 FAQ:

 http://wiki.apache.org/hadoop/FAQ

 1. Update conf/slave
 2. on the slave nodes start datanode and tasktracker
 3. hadoop balancer

 Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote:

 Not quite. Datanodes get the namenode host from fs.defalt.name in
 core-site.xml. Task trackers find the job tracker from the
 mapred.job.tracker setting in mapred-site.xml.


I actually meant to ask how does namenode/jobtracker know there is a new
node in the cluster. Is it initiated by namenode when slave file is edited?
Or is it initiated by tasktracker when tasktracker is started?


 Sent from my iPhone

 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  You only have to refresh nodes if you're making use of an allows file.
 
  Thanks does it mean that when tasktracker/datanode starts up it
  communicates with namenode using master file?
 
  Sent from my iPhone
 
  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:
 
  Is this the right procedure to add nodes? I took some from hadoop wiki
  FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?

Re: Adding nodes

2012-03-01 Thread anil gupta

Whatever Joey said is correct for Cloudera's distribution. For same, I am
not confident about other distribution as i haven't tried them.

Thanks,
Anil

On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 The master and slave files, if I remember correctly are used to start the
 correct daemons on the correct nodes from the master node.


 Raj


 
  From: Joey Echeverria j...@cloudera.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Thursday, March 1, 2012 4:57 PM
 Subject: Re: Adding nodes
 
 Not quite. Datanodes get the namenode host from fs.defalt.name in
 core-site.xml. Task trackers find the job tracker from the
 mapred.job.tracker setting in mapred-site.xml.
 
 Sent from my iPhone
 
 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:
 
  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  You only have to refresh nodes if you're making use of an allows file.
 
  Thanks does it mean that when tasktracker/datanode starts up it
  communicates with namenode using master file?
 
  Sent from my iPhone
 
  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  Is this the right procedure to add nodes? I took some from hadoop wiki
  FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?
 
 
 
 




-- 
Thanks  Regards,
Anil Gupta

Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan

WHat Joey said is correct for both apache and cloudera distros. The DN/TT  
daemons  will connect to the NN/JT using the config files. The master and slave 
files are used for starting the correct daemons.

 From: anil gupta anilg...@buffalo.edu
To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com 
Sent: Thursday, March 1, 2012 5:42 PM
Subject: Re: Adding nodes

Whatever Joey said is correct for Cloudera's distribution. For same, I am
not confident about other distribution as i haven't tried them.

Thanks,
Anil

On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 The master and slave files, if I remember correctly are used to start the
 correct daemons on the correct nodes from the master node.

 Raj

  From: Joey Echeverria j...@cloudera.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Thursday, March 1, 2012 4:57 PM
 Subject: Re: Adding nodes

 Not quite. Datanodes get the namenode host from fs.defalt.name in
 core-site.xml. Task trackers find the job tracker from the
 mapred.job.tracker setting in mapred-site.xml.

 Sent from my iPhone

 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com
 wrote:

  You only have to refresh nodes if you're making use of an allows file.

  Thanks does it mean that when tasktracker/datanode starts up it
  communicates with namenode using master file?

  Sent from my iPhone

  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Is this the right procedure to add nodes? I took some from hadoop wiki
  FAQ:

  http://wiki.apache.org/hadoop/FAQ

  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer

  Do I also need to run dfsadmin -refreshnodes?

-- 
Thanks  Regards,
Anil Gupta

Re: Adding nodes

2012-03-01 Thread Arpit Gupta

It is initiated by the slave.If you have defined files to state which slaves can talk to the namenode (using configdfs.hosts) and which hosts cannot (using propertydfs.hosts.exclude) then you would need to edit these files and issue the refresh command.On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote:Not quite. Datanodes get the namenode host from fs.defalt.name incore-site.xml. Task trackers find the job tracker from themapred.job.tracker setting in mapred-site.xml.I actually meant to ask how does namenode/jobtracker know there is a newnode in the cluster. Is it initiated by namenode when slave file is edited?Or is it initiated by tasktracker when tasktracker is started?Sent from my iPhoneOn Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.comwrote:You only have to refresh nodes if you're making use of an allows file.Thanks does it mean that when tasktracker/datanode starts up itcommunicates with namenode using master file?Sent from my iPhoneOn Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:Is this the right procedure to add nodes? I took some from hadoop wikiFAQ:http://wiki.apache.org/hadoop/FAQ1. Update conf/slave2. on the slave nodes start datanode and tasktracker3. hadoop balancerDo I also need to run dfsadmin -refreshnodes?
--ArpitHortonworks, Inc.email: ar...@hortonworks.com

Re: Adding nodes

Thanks all for the answers!!

On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta ar...@hortonworks.com wrote:

 It is initiated by the slave.

 If you have defined files to state which slaves can talk to the namenode
 (using config dfs.hosts) and which hosts cannot (using
 property dfs.hosts.exclude) then you would need to edit these files and
 issue the refresh command.


  On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:

  On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com
 wrote:

 Not quite. Datanodes get the namenode host from fs.defalt.name in

 core-site.xml. Task trackers find the job tracker from the

 mapred.job.tracker setting in mapred-site.xml.



 I actually meant to ask how does namenode/jobtracker know there is a new
 node in the cluster. Is it initiated by namenode when slave file is edited?
 Or is it initiated by tasktracker when tasktracker is started?


 Sent from my iPhone


 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:


  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com

 wrote:


  You only have to refresh nodes if you're making use of an allows file.


  Thanks does it mean that when tasktracker/datanode starts up it

  communicates with namenode using master file?


  Sent from my iPhone


  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:


   Is this the right procedure to add nodes? I took some from hadoop wiki

  FAQ:


   http://wiki.apache.org/hadoop/FAQ


   1. Update conf/slave

   2. on the slave nodes start datanode and tasktracker

   3. hadoop balancer


   Do I also need to run dfsadmin -refreshnodes?





 --
 Arpit
 Hortonworks, Inc.
 email: ar...@hortonworks.com

 http://www.hadoopsummit.org/
  http://www.hadoopsummit.org/
 http://www.hadoopsummit.org/

Re: Adding nodes

2012-03-01 Thread George Datskos


Mohit,

New datanodes will connect to the namenode so thats how the namenode 
knows.  Just make sure the datanodes have the correct {fs.default.dir} 
in their hdfs-site.xml and then start them.  The namenode can, however, 
choose to reject the datanode if you are using the {dfs.hosts} and 
{dfs.hosts.exclude} settings in the namenode's hdfs-site.xml.


The namenode doesn't actually care about the slaves file.  It's only 
used by the start/stop scripts.



On 2012/03/02 10:35, Mohit Anchlia wrote:

I actually meant to ask how does namenode/jobtracker know there is a new
node in the cluster. Is it initiated by namenode when slave file is edited?
Or is it initiated by tasktracker when tasktracker is started?

Re: LZO exception decompressing (returned -8)

Tried but still getting the error 0.4.15. Really lost with this.
My hadoop release is 0.20.2 from more than a year ago. Could this be related
to the problem?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Harsh J

Marc,

Was the lzo libs on your server upgraded to a higher version recently?

Also, when you deployed a built copy of 0.4.15, did you ensure you
replaced the older native libs for hadoop-lzo as well?

On Fri, Mar 2, 2012 at 9:05 AM, Marc Sturlese marc.sturl...@gmail.com wrote:
 Tried but still getting the error 0.4.15. Really lost with this.
 My hadoop release is 0.20.2 from more than a year ago. Could this be related
 to the problem?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



-- 
Harsh J

Re: LZO exception decompressing (returned -8)

Yes, The steps I followed where:
1-Intall lzo 2.06 in a machine with the same kernel as my nodes.
2-Compile there lzo 0.4.15 (in /lib replaced cdh3u3 per my hadoop 0.20.2
release)
3-Replace hadoop-lzo-0.4.9.jar for the now compiled hadoop-lzo-0.4.15.jar in
the hadoop lib directory of all my nodes and master
4-Put de generated native files in the native lib directory of all the nodes
and master
5-In my jar job, replaced the jar library hadoop-lzo-0.4.9.jar for
hadoop-lzo-0.4.15.jar

And sometimes when a job is running I get (4 times so the job gets killed):

...org.apache.hadoop.mapred.ReduceTask: Shuffling 3188320 bytes (1025174 raw
bytes) into RAM from attempt_201202291221_1501_m_000480_0
2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201202291221_1501_r_000105_0: Failed fetch #1 from
attempt_201202291221_1501_m_46_0
2012-03-02 02:32:55,496 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201202291221_1501_r_000105_0 adding host hadoop-01.backend to
penalty box, next contact in 4 seconds
2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201202291221_1501_r_000105_0: Got 1 map-outputs from previous
failures
2012-03-02 02:32:55,497 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201202291221_1501_r_000105_0 : Map output copy failure :
java.lang.InternalError: lzo1x_decompress returned: -8
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792505.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

I use to have 2.05 but now as I said I installed 2.06

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Joey Echeverria

I know this doesn't fix lzo, but have you considered Snappy for the
intermediate output compression? It gets similar compression ratios
and compress/decompress speed, but arguably has better Hadoop
integration.

-Joey

On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese marc.sturl...@gmail.com wrote:
 I use to have 2.05 but now as I said I installed 2.06

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: LZO exception decompressing (returned -8)

Absolutely. In case I don't find the root of the problem soon I'll definitely
try it.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792531.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: LZO exception decompressing (returned -8)

Absolutely. In case I don't find the root of the problem soon I'll definitely
try it.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792530.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Subir S

Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Thanks, Subir

Re: Where Is DataJoinMapperBase?

Hi,
 Please look inside $HADOOP_HOME/contrib/datajoin folder of 0.20.2 version.
You will find the jar.

On Sat, Feb 11, 2012 at 1:09 AM, Bing Li lbl...@gmail.com wrote:

 Hi, all,

 I am starting to learn advanced Map/Reduce. However, I cannot find the
 class DataJoinMapperBase in my downloaded Hadoop 1.0.0 and 0.20.2. So I
 searched on the Web and get the following link.

 http://www.java2s.com/Code/Jar/h/Downloadhadoop0201datajoinjar.htm

 From the link I got the package, hadoop-0.20.1-datajoin.jar. My question is
 why the package is not included in Hadoop 1.0.0 and 0.20.2? Is the correct
 way to get it?

 Thanks so much!

 Best regards,
 Bing




-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: DFSIO

Hi,
 Only HDFS should be enough.

On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do than...@cs.wisc.edu wrote:

 hi all,

 in order to run DFSIO in my cluster,
 do i need to run JobTracker, and TaskTracker,
 or just running HDFS is enough?

 Many thanks,
 Thanh




-- 
Join me at http://hadoopworkshop.eventbrite.com/

Re: DFSIO

2012-03-01 Thread Harsh J

Madhu,

That is incorrect. TestDFSIO is a MapReduce job and you need HDFS+MR
setup to use it.

On Fri, Mar 2, 2012 at 11:07 AM, madhu phatak phatak@gmail.com wrote:
 Hi,
  Only HDFS should be enough.

 On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do than...@cs.wisc.edu wrote:

 hi all,

 in order to run DFSIO in my cluster,
 do i need to run JobTracker, and TaskTracker,
 or just running HDFS is enough?

 Many thanks,
 Thanh




 --
 Join me at http://hadoopworkshop.eventbrite.com/



-- 
Harsh J

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Jie Li

Considering Pig essentially translates scripts into Map Reduce jobs, one
can always write as good Map Reduce jobs as Pig does. You can refer to Pig
experience paper to see the overhead Pig introduces, but it's been
improved all the time.

Btw if you really care about the performance, how you configure Hadoop and
Pig can also play an important role.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish

On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com wrote:

 Hello Folks,

 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?

 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available

 Thanks, Subir

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Harsh J

On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote:
 Hello Folks,

 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.

 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.

-- 
Harsh J

Re: DFSIO