Re: Ways to improve job cleanup speed

2012-02-24 Thread Sameer Farooqui
Hi Tanya,

What version of Hadoop are you running?

Is this a 1-node cluster running in pseudo-distributed mode with 1 physical
spinning hard drive?

How much intermediate data is being emitted from the Map phase?

How many mappers and reducers total is the job running?


--
Sameer Farooqui
Systems Architect / Hortonworks




On Thu, Feb 23, 2012 at 7:08 AM, tanyasch ta...@tickel.net wrote:


 Hi, I'm running a job that completes in about a 90 seconds, but takes about
 10-15 minutes to run cleanup.  I'm looking for ways to affect or even
 monitor the cleanup time.  I'd like even advice about whether this is more
 of a setup issue (like where I'm storing files, with Accumulo and Hadoop
 temporary and log files all writing to the same disk because our cluster is
 tiny) or a job issue (can I throw more reducers at it ? the brief
 description of the OutputCommitter says it uses available reducers for
 cleanup)  or a programming issue (in that case I'd post a different
 question)

 Basically, I want to know if the first way to go at this is by
 reconfiguring
 the cluster or if I should be programming my way out of this?  Thanks.

 --
 View this message in context:
 http://old.nabble.com/Ways-to-improve-job-cleanup-speed-tp33377374p33377374.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Streaming job hanging

2012-02-24 Thread Sameer Farooqui
Hi Mohit,

Can you provide some more info about the job you're trying to run? What
version of Hadoop are you using? What language is the Hadoop streaming job
written in? Have you been able to run any Hadoop streaming
jobs successfully in this cluster? I'm wondering if all Hadoop streaming
jobs fail, or just this one is failing.

Instead of running this on a file with possibly 551 blocks, can you try to
run it on a small file with like 1 or 2 blocks and see if it runs
successfully?

When I ran a Hadoop streaming job with Python, on a few small files (1-2
MB), the job ran pretty quickly in 77 seconds (for the Map+Reduce phases):


packageJobJar: [/home/hduser/mapper.py, /home/hduser/reducer.py,
/mnt/hadoop/tmp/hadoop-unjar5368493284653516019/] []
/tmp/streamjob8122180536767888261.jar tmpDir=null
11/09/06 23:38:04 INFO mapred.FileInputFormat: Total input paths to process
: 3
11/09/06 23:38:05 INFO streaming.StreamJob: getLocalDirs():
[/mnt/hadoop/tmp/mapred/local]
11/09/06 23:38:05 INFO streaming.StreamJob: Running job:
job_201109062238_0001
11/09/06 23:38:05 INFO streaming.StreamJob: To kill this job, run:
11/09/06 23:38:05 INFO streaming.StreamJob:
/usr/local/hadoop/bin/../bin/hadoop job
 -Dmapred.job.tracker=localhost:54311 -kill job_201109062238_0001
11/09/06 23:38:05 INFO streaming.StreamJob: Tracking URL:
http://localhost:50030/jobdetails.jsp?jobid=job_201109062238_0001
11/09/06 23:38:06 INFO streaming.StreamJob:  map 0%  reduce 0%
11/09/06 23:38:26 INFO streaming.StreamJob:  map 32%  reduce 0%
11/09/06 23:38:29 INFO streaming.StreamJob:  map 39%  reduce 0%
11/09/06 23:38:32 INFO streaming.StreamJob:  map 48%  reduce 0%
11/09/06 23:38:35 INFO streaming.StreamJob:  map 50%  reduce 0%
11/09/06 23:38:50 INFO streaming.StreamJob:  map 75%  reduce 0%
11/09/06 23:38:53 INFO streaming.StreamJob:  map 100%  reduce 0%
11/09/06 23:38:56 INFO streaming.StreamJob:  map 100%  reduce 17%
11/09/06 23:39:08 INFO streaming.StreamJob:  map 100%  reduce 67%
11/09/06 23:39:12 INFO streaming.StreamJob:  map 100%  reduce 76%
11/09/06 23:39:14 INFO streaming.StreamJob:  map 100%  reduce 86%
11/09/06 23:39:17 INFO streaming.StreamJob:  map 100%  reduce 96%
11/09/06 23:39:23 INFO streaming.StreamJob:  map 100%  reduce 100%
11/09/06 23:39:29 INFO streaming.StreamJob: Job complete:
job_201109062238_0001
11/09/06 23:39:29 INFO streaming.StreamJob: Output:
/hduser/wordcount_python-output


--
Sameer Farooqui
Systems Architect / Hortonworks




On Wed, Feb 22, 2012 at 8:38 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Streaming job just seems to be hanging

 12/02/22 17:35:50 INFO streaming.StreamJob: map 0% reduce 0%

 -

 On the admin page I see that it created 551 input split. Could somone
 suggest a way to find out what might be causing it to hang? I increased
 io.sort.mb to 200 MB.

 I am using 5 data nodes with 12 CPU, 96G RAM.



BZip2 Splittable?

2012-02-24 Thread Daniel Baptista
Hi All,

I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
that take a series of bzip2 compressed text files as input.

I have read conflicting articles regarding whether or not hadoop can split 
these bzip2 files, can anyone give me a definite answer?

Thanks is advance, Dan.


Re: BZip2 Splittable?

2012-02-24 Thread Rohit Bakhshi
Hi Daniel, 

Bzip2 compression codec allows for splittable files.

According to this Hadoop JIRA improvement, splitting of bzip2 compressed files 
in Hadoop jobs is supported:
https://issues.apache.org/jira/browse/HADOOP-4012

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:

 Hi All,
 
 I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
 that take a series of bzip2 compressed text files as input.
 
 I have read conflicting articles regarding whether or not hadoop can split 
 these bzip2 files, can anyone give me a definite answer?
 
 Thanks is advance, Dan. 



Re: BZip2 Splittable?

2012-02-24 Thread Rohit Bakhshi
Daniel, 

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop. 

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

 Hi Daniel, 
 
 Bzip2 compression codec allows for splittable files.
 
 According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
 files in Hadoop jobs is supported:
 https://issues.apache.org/jira/browse/HADOOP-4012
 
 -- 
 Rohit Bakhshi
 www.hortonworks.com (http://www.hortonworks.com/)
 
 
 
 
 On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
 
  Hi All,
  
  I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
  that take a series of bzip2 compressed text files as input.
  
  I have read conflicting articles regarding whether or not hadoop can split 
  these bzip2 files, can anyone give me a definite answer?
  
  Thanks is advance, Dan. 
 



RE: BZip2 Splittable?

2012-02-24 Thread Daniel Baptista
Hi Rohit, thanks for the response, this is pretty much as I expected and 
hopefully adds weight to my other thoughts...

Could this mean that all my datanodes are being sent all of the data or that 
only one datanode is executing the job. 

Thanks again , Dan.

-Original Message-
From: Rohit Bakhshi [mailto:ro...@hortonworks.com] 
Sent: 24 February 2012 15:54
To: common-user@hadoop.apache.org
Subject: Re: BZip2 Splittable?

Daniel, 

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop. 

-- 
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

 Hi Daniel, 
 
 Bzip2 compression codec allows for splittable files.
 
 According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
 files in Hadoop jobs is supported:
 https://issues.apache.org/jira/browse/HADOOP-4012
 
 -- 
 Rohit Bakhshi
 www.hortonworks.com (http://www.hortonworks.com/)
 
 
 
 
 On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
 
  Hi All,
  
  I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
  that take a series of bzip2 compressed text files as input.
  
  I have read conflicting articles regarding whether or not hadoop can split 
  these bzip2 files, can anyone give me a definite answer?
  
  Thanks is advance, Dan. 
 




CONFIDENTIALITY - This email and any files transmitted with it, are 
confidential, may be legally privileged and are intended solely for the use of 
the individual or entity to whom they are addressed. If this has come to you in 
error, you must not copy, distribute, disclose or use any of the information it 
contains. Please notify the sender immediately and delete them from your system.

SECURITY - Please be aware that communication by email, by its very nature, is 
not 100% secure and by communicating with Perform Group by email you consent to 
us monitoring and reading any such correspondence.

VIRUSES - Although this email message has been scanned for the presence of 
computer viruses, the sender accepts no liability for any damage sustained as a 
result of a computer virus and it is the recipient’s responsibility to ensure 
that email is virus free.

AUTHORITY - Any views or opinions expressed in this email are solely those of 
the sender and do not necessarily represent those of Perform Group.

COPYRIGHT - Copyright of this email and any attachments belongs to Perform 
Group, Companies House Registration number 6324278.


RE: BZip2 Splittable?

2012-02-24 Thread Tim Broberg
Support starts in 0.21, yes. It will soon be backported and available in 1.1.0. 
A patch to 1.0.0 to enable bzip2 splittability is here, 
https://issues.apache.org/jira/browse/HADOOP-7823, if you feel up to patching 
and rebuilding.

- Tim.

From: Rohit Bakhshi [ro...@hortonworks.com]
Sent: Friday, February 24, 2012 7:53 AM
To: common-user@hadoop.apache.org
Subject: Re: BZip2 Splittable?

Daniel,

I just noticed your Hadoop version - 0.20.2.

The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
may not be supported on your version of Hadoop.

--
Rohit Bakhshi
www.hortonworks.com (http://www.hortonworks.com/)




On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:

 Hi Daniel,

 Bzip2 compression codec allows for splittable files.

 According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
 files in Hadoop jobs is supported:
 https://issues.apache.org/jira/browse/HADOOP-4012

 --
 Rohit Bakhshi
 www.hortonworks.com (http://www.hortonworks.com/)




 On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:

  Hi All,
 
  I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
  that take a series of bzip2 compressed text files as input.
 
  I have read conflicting articles regarding whether or not hadoop can split 
  these bzip2 files, can anyone give me a definite answer?
 
  Thanks is advance, Dan.


The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.


Re: BZip2 Splittable?

2012-02-24 Thread John Heidemann
On Fri, 24 Feb 2012 15:43:10 GMT, Daniel Baptista wrote: 
Hi All,

I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 
that take a series of bzip2 compressed text files as input.

I have read conflicting articles regarding whether or not hadoop can split 
these bzip2 files, can anyone give me a definite answer?

Thanks is advance, Dan.

Support for bzip2 splitting was only added in 0.21.0; see 
https://issues.apache.org/jira/browse/MAPREDUCE-830

You need to roll forward (or backport the patch) if you want bzip2
splitting.

(And since 1.0.0 is a fork from 0.20-security, it also lacks bzip2
splitting, AFAIK.  Hopefully some future 1.x will pick up more of the
0.21 features.)

   -John Heidemann


RE: Experience with Hadoop in production

2012-02-24 Thread GOEKE, MATTHEW (AG/1000)
I would add that it also depends on how thoroughly you have vetted your use 
cases. If you have already ironed out how ad-hoc access works, Kerberos vs 
Firewall and network segmentation, how code submission works, procedures for 
various operational issues, backup of your data, etc (the list is a couple 
hundred bullets long at minimum...) on your current cluster then there might be 
little need for that support. However if you are hoping to figure that stuff 
out still then you could potentially be in a world of hurt when you attempt the 
transition with just your own staff. It also helps to have that outside advice 
in certain situations to resolve cross department conflicts when it comes to 
how the cluster will be implemented :)

Matt

-Original Message-
From: Mike Lyon [mailto:mike.l...@gmail.com] 
Sent: Thursday, February 23, 2012 2:33 PM
To: common-user@hadoop.apache.org
Subject: Re: Experience with Hadoop in production

Just be sure you have that corporate card available 24x7 when you need
to call support ;)

Sent from my iPhone

On Feb 23, 2012, at 10:30, Serge Blazhievsky
serge.blazhiyevs...@nice.com wrote:

 What I have seen companies do often is that they will use free version of
 the commercial vendor and only get their support if there are major
 problems that they cannot solve on their own.


 That way you will get free distribution and insurance that you have
 support if something goes wrong.


 Serge

 On 2/23/12 10:42 AM, Jamack, Peter pjam...@consilium1.com wrote:

 A lot of it depends on your staff and their experiences.
 Maybe they don't have hadoop, but if they were involved with large
 databases, data warehouse, etc they can utilize their skills  experiences
 and provide a lot of help.
 If you have linux admins, system admins, network admins with years of
 experience, they will be a goldmine.At the other end, database
 developers who know SQL, programmers who know Java, and so on can really
 help staff up your 'big data' team. Having a few people who know ETL would
 be great too.

 The biggest problem I've run into seems to be how big the Hadoop
 project/team is or is not. Sometimes it's just an 'experimental'
 department and therefore half the people are only 25-50 percent available
 to help out.  And if they aren't really that knowledgeable about hadoop,
 it tends to be one of those, not enough time in the day scenarios.  And
 the few people dedicated to the Hadoop project(s) will get the brunt of
 the work.

 It's like any ecosystem.  To do it right, you might need system/network
 admins, a storage person to actually know how to set up the proper storage
 architecture, maybe a security expert,  a few programmers, and a few data
 people.   If you're combining analytics, that's another group.  Of course
 most companies outside the Google and Facebooks of the world,  will have a
 few people dedicated to Hadoop.  Which means you need somebody who knows
 storage, knows networking, knows linux, knows how to be a system admin,
 knows security, and maybe other things(AKA if you have a firewall issue,
 somebody needs to figure out ways to make it work through or around),  and
 then you need some programmes who either know MapReduce or can pretty much
 figure it out because they've done java for years.

 Peter J

 On 2/23/12 10:17 AM, Pavel Frolov pfro...@gmail.com wrote:

 Hi,

 We are going into 24x7 production soon and we are considering whether we
 need vendor support or not.  We use a free vendor distribution of Cluster
 Provisioning + Hadoop + HBase and looked at their Enterprise version but
 it
 is very expensive for the value it provides (additional functionality +
 support), given that we¹ve already ironed out many of our performance and
 tuning issues on our own and with generous help from the community (e.g.
 all of you).

 So, I wanted to run it through the community to see if anybody can share
 their experience of running a Hadoop cluster (50+ nodes with Apache
 releases or Vendor distributions) in production, with in-house support
 only, and how difficult it was.  How many people were involved, etc..

 Regards,
 Pavel


This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to 

Re: BZip2 Splittable?

2012-02-24 Thread Rohit
Hi Daniel,  

Because your MapReduce jobs will not split bzip2 files, each entire bzip2 file 
will be processed by one Map task. Thus, if your job takes multiple bzip2 text 
files as the input, then you'll have as many Map tasks as you have files 
running in parallel.

The Map tasks will be run by your TaskTrackers. Usually the cluster setup has 
the DataNode and the TaskTracker processing running on the same machines - so 
with 6 data nodes, you have 6 tasktrackers.

Hope that answers your question.


Rohit Bakhshi



www.hortonworks.com (http://www.hortonworks.com/)



On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:  
 Hi Rohit, thanks for the response, this is pretty much as I expected and 
 hopefully adds weight to my other thoughts...
  
 Could this mean that all my datanodes are being sent all of the data or that 
 only one datanode is executing the job.  
  
 Thanks again , Dan.
  
 -Original Message-
 From: Rohit Bakhshi [mailto:ro...@hortonworks.com]  
 Sent: 24 February 2012 15:54
 To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
 Subject: Re: BZip2 Splittable?
  
 Daniel,  
  
 I just noticed your Hadoop version - 0.20.2.
  
 The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it 
 may not be supported on your version of Hadoop.  
  
 --  
 Rohit Bakhshi
 www.hortonworks.com (http://www.hortonworks.com/)
  
  
  
  
 On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
  
  Hi Daniel,  
   
  Bzip2 compression codec allows for splittable files.
   
  According to this Hadoop JIRA improvement, splitting of bzip2 compressed 
  files in Hadoop jobs is supported:
  https://issues.apache.org/jira/browse/HADOOP-4012
   
  --  
  Rohit Bakhshi
  www.hortonworks.com (http://www.hortonworks.com/)
   
   
   
   
  On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
   
   Hi All,

   I have a cluster of 6 datanodes, all running hadoop version 0.20.2, 
   r911707 that take a series of bzip2 compressed text files as input.

   I have read conflicting articles regarding whether or not hadoop can 
   split these bzip2 files, can anyone give me a definite answer?

   Thanks is advance, Dan.  
  
  
 
  
 CONFIDENTIALITY - This email and any files transmitted with it, are 
 confidential, may be legally privileged and are intended solely for the use 
 of the individual or entity to whom they are addressed. If this has come to 
 you in error, you must not copy, distribute, disclose or use any of the 
 information it contains. Please notify the sender immediately and delete them 
 from your system.
  
 SECURITY - Please be aware that communication by email, by its very nature, 
 is not 100% secure and by communicating with Perform Group by email you 
 consent to us monitoring and reading any such correspondence.
  
 VIRUSES - Although this email message has been scanned for the presence of 
 computer viruses, the sender accepts no liability for any damage sustained as 
 a result of a computer virus and it is the recipient’s responsibility to 
 ensure that email is virus free.
  
 AUTHORITY - Any views or opinions expressed in this email are solely those of 
 the sender and do not necessarily represent those of Perform Group.
  
 COPYRIGHT - Copyright of this email and any attachments belongs to Perform 
 Group, Companies House Registration number 6324278.  



Re: BZip2 Splittable?

2012-02-24 Thread Srinivas Surasani
@Daniel,

If you want to process bz2 files in parallel( more than one mapper/reducer
), you can go for Pig.

See below.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support
is coming soon). If the input file name extension is .bz2, Pig decompresses
the file on the fly and passes the decompressed input stream to your load
function.

Regards,


On Fri, Feb 24, 2012 at 2:59 PM, Rohit ro...@hortonworks.com wrote:

 Hi Daniel,

 Because your MapReduce jobs will not split bzip2 files, each entire bzip2
 file will be processed by one Map task. Thus, if your job takes multiple
 bzip2 text files as the input, then you'll have as many Map tasks as you
 have files running in parallel.

 The Map tasks will be run by your TaskTrackers. Usually the cluster setup
 has the DataNode and the TaskTracker processing running on the same
 machines - so with 6 data nodes, you have 6 tasktrackers.

 Hope that answers your question.


 Rohit Bakhshi



 www.hortonworks.com (http://www.hortonworks.com/)



 On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:
  Hi Rohit, thanks for the response, this is pretty much as I expected and
 hopefully adds weight to my other thoughts...
 
  Could this mean that all my datanodes are being sent all of the data or
 that only one datanode is executing the job.
 
  Thanks again , Dan.
 
  -Original Message-
  From: Rohit Bakhshi [mailto:ro...@hortonworks.com]
  Sent: 24 February 2012 15:54
  To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
  Subject: Re: BZip2 Splittable?
 
  Daniel,
 
  I just noticed your Hadoop version - 0.20.2.
 
  The JIRA fix below is for Hadoop 0.21.0, which is a different version.
 So it may not be supported on your version of Hadoop.
 
  --
  Rohit Bakhshi
  www.hortonworks.com (http://www.hortonworks.com/)
 
 
 
 
  On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
 
   Hi Daniel,
  
   Bzip2 compression codec allows for splittable files.
  
   According to this Hadoop JIRA improvement, splitting of bzip2
 compressed files in Hadoop jobs is supported:
   https://issues.apache.org/jira/browse/HADOOP-4012
  
   --
   Rohit Bakhshi
   www.hortonworks.com (http://www.hortonworks.com/)
  
  
  
  
   On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
  
Hi All,
   
I have a cluster of 6 datanodes, all running hadoop version 0.20.2,
 r911707 that take a series of bzip2 compressed text files as input.
   
I have read conflicting articles regarding whether or not hadoop can
 split these bzip2 files, can anyone give me a definite answer?
   
Thanks is advance, Dan.
 
 
  
 
  CONFIDENTIALITY - This email and any files transmitted with it, are
 confidential, may be legally privileged and are intended solely for the use
 of the individual or entity to whom they are addressed. If this has come to
 you in error, you must not copy, distribute, disclose or use any of the
 information it contains. Please notify the sender immediately and delete
 them from your system.
 
  SECURITY - Please be aware that communication by email, by its very
 nature, is not 100% secure and by communicating with Perform Group by email
 you consent to us monitoring and reading any such correspondence.
 
  VIRUSES - Although this email message has been scanned for the presence
 of computer viruses, the sender accepts no liability for any damage
 sustained as a result of a computer virus and it is the recipient’s
 responsibility to ensure that email is virus free.
 
  AUTHORITY - Any views or opinions expressed in this email are solely
 those of the sender and do not necessarily represent those of Perform Group.
 
  COPYRIGHT - Copyright of this email and any attachments belongs to
 Perform Group, Companies House Registration number 6324278.




-- 
Regards,
-- Srinivas
srini...@cloudwick.com


PathFilter File Glob

2012-02-24 Thread Heeg, Simon
Hello,

I would like to use a PathFilter for filtering the files with a regular 
expression which are read by the TextInputFormat, but I don't know how to apply 
the filter. I cannot find a setter. Unfortunately google was not my friend with 
this issue and The definitive Guide does  not help that much.  I am using 
Hadoop 0.20.2-cdh3u3.

Please Help!

Kind regards
Simon

Deutsche Telekom AG
Products  Innovation
Simon Heeg
Werkstudent
T-Online-Allee 1, 64295 Darmstadt
+49 6151 680-7835 (Tel.)
E-Mail: s.h...@telekom.demailto:vorname.nachn...@telekom.de
www.telekom.comhttp://www.telekom.com
Erleben, was verbindet.
Deutsche Telekom AG
Aufsichtsrat: Prof. Dr. Ulrich Lehner (Vorsitzender)
Vorstand: René Obermann (Vorsitzender),
Dr. Manfred Balz, Reinhard Clemens, Niek Jan van Damme,
Timotheus Höttges, Edward Kozel, Claudia Nemat, Thomas Sattelberger
Handelsregister: Amtsgericht Bonn HRB 6794
Sitz der Gesellschaft: Bonn
WEEE-Reg.-Nr. DE50478376
Große Veränderungen fangen klein an - Ressourcen schonen und nicht jede E-Mail 
drucken.

Hinweis: Diese E-Mail und / oder die Anhänge ist / sind vertraulich und 
ausschließlich für den bezeichneten Adressaten bestimmt. Jegliche Durchsicht, 
Weitergabe oder Kopieren dieser E-Mail ist strengstens verboten. Wenn Sie diese 
E-Mail irrtümlich erhalten haben, informieren Sie bitte unverzüglich den 
Absender und vernichten Sie die Nachricht und alle Anhänge. Vielen Dank.



MapReduce tunning

2012-02-24 Thread Mohit Anchlia
I am looking at some hadoop tuning parameters like io.sort.mb,
mapred.child.javaopts etc.

- My question was where to look at for current setting
- Are these settings configured cluster wide or per job?
- What's the best way to look at reasons of slow performance?


Re: Consistent register getProtocolVersion error due to Duplicate metricsName:getProtocolVersion during cluster startup -- then various other errors during job execution

2012-02-24 Thread Ali S Kureishy
Hi again,

Would you be able to make any suggestions to the below?

Thanks in advance...

Safdar
On Feb 21, 2012 12:04 PM, Ali S Kureishy safdar.kurei...@gmail.com
wrote:

 Hi,

 I've got a pseudo-distributed Hadoop (v0.20.02) setup with 1 machine (with
 Ubuntu 10.04 LTS) running all the hadoop processes (NN + SNN + JT + TT +
 DN). I've also configured the files under conf/ so that the master is
 referred to by its actual machine name (in this case, *bali*), instead of
 localhost (however, the issue below is seen regardless). I was able to
 successfully format the HDFS (by running hadoop namenode –format). However,
 right after I deploy the cluster using bin/start-all.sh, I see the
 following error in the NameNode's config file. It is an INFO error, but I
 believe it is the root cause behind various other errors I am encountering
 when executing actual Hadoop jobs. (For instance, at one point I see errors
 that the datanode and namenode were communicating using different protocol
 versions ... 3 vs 6 etc.). Anyway, here is the initial error:

 *2012-02-21 09:01:42,015 INFO org.apache.hadoop.ipc.Server: Error
 register getProtocolVersion
 java.lang.**IllegalArgumentException: Duplicate
 metricsName:getProtocolVersion
 at org.apache.hadoop.metrics.**util.MetricsRegistry.add(**
 MetricsRegistry.java:53)
 at org.apache.hadoop.metrics.**util.MetricsTimeVaryingRate.**
 init(MetricsTimeVaryingRate.**java:89)
 at org.apache.hadoop.metrics.**util.MetricsTimeVaryingRate.**
 init(MetricsTimeVaryingRate.**java:99)
 at org.apache.hadoop.ipc.RPC$**Server.call(RPC.java:523)
 at org.apache.hadoop.ipc.Server$**Handler$1.run(Server.java:959)
 at org.apache.hadoop.ipc.Server$**Handler$1.run(Server.java:955)
 at java.security.**AccessController.doPrivileged(**Native Method)
 at javax.security.auth.Subject.**doAs(Subject.java:396)
 at org.apache.hadoop.ipc.Server$**Handler.run(Server.java:953)
 *
 I’ve scoured the web searching for other instances of this error, but none
 of the hits were helpful, nor relevant to my setup. My hunch is that this
 is preventing the cluster from correctly initializing. I would have
 switched to a later version of Hadoop, but the Nutch v1.4 distribution I’m
 trying to run on top of Hadoop is, AFAIK, only compatible with Hadoop
 v0.20. I have included with this email all my hadoop config files
 (config.rar), in case you need to take a quick look. Below is my /etc/hosts
 configuration, in case the issue is with that. I believe this is a
 hadoop-specific issue, and not related to Nutch, hence am posting to the
 hadoop mailing list.

 *ETC/HOSTS:
 **127.0.0.1   localhost
 #127.0.1.1  bali** **

 # The following lines are desirable for IPv6 capable hosts
 ::1 localhost ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters

 192.168.1.21 bali

 **
 FILE-SYSTEM layout:**
 *Here's my filesystem layout. I've got all my hadoop configs pointing to
 folders under a root folder called */private/user/hadoop*, with the
 following permissions.
 *ls -l /private/user/
 *total 4
 drwxrwxrwx 7 user alt 4096 Feb 21 09:06 hadoop

 *ls -l /private/user/hadoop/
 *total 20
 drwxr-xr-x 5 user alt 4096 Feb 21 09:01 data
 drwxr-xr-x 3 user alt 4096 Feb 21 09:07 mapred
 drwxr-xr-x 4 user alt 4096 Feb 21 08:59 name
 drwxr-xr-x 2 user alt 4096 Feb 21 08:59 pids
 drwxr-xr-x 3 user alt 4096 Feb 21 09:01 tmp

 Shortly after the getProtocolVersion error above, I start seeing these
 errors in the namenode log:
 *2012-02-21 09:06:47,895 WARN org.mortbay.log: /getimage:
 java.io.IOException: GetImage failed. java.io.IOException: Server returned
 HTTP response code: 503 for URL:
 http://192.168.1.21:50090/getimage?getimage=1
 at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
 at
 org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:151)

 at
 org.apache.hadoop.hdfs.server.namenode.GetImageServlet.doGet(GetImageServlet.java:58)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
 at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
 at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
 at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at