Hbase and ghost regionservers

2012-08-16 Thread Håvard Wahl Kongsgård
Hi, when I start my hbase cluster, hbase sometimes list 'ghost
regionservers' without any regions

kongs2.medisin.ntnu.no:60020 1345115409411
requests=0, regions=0, usedHeap=0, maxHeap=0

netstat does not list any services on 60020

if I start a region server locally

I get one real services and the ghost server

kongs2.medisin.ntnu.no:60020 1345119112497
requests=0, regions=64, usedHeap=141, maxHeap=1487
kongs2.medisin.ntnu.no:60020 1345119112497
requests=0, regions=0, usedHeap=0, maxHeap=0

Is it possible to manually blacklist or remove regionserver, via the
hbase shell?

I use the latest hbase from cloudera

-Håvard


Relevance of mapreduce.* configuration properties for MR V2

2012-08-16 Thread mg

Hi,

I am currently trying to tune a CDH 4.0.1 cluster running HDFS, YARN, 
and HBase managed by Cloudera Manager 4.0.3 (Free Edition).


In CM, there are a number of options for setting mapreduce.* 
configuration properties on the YARN client page.


Some of the explanations in the GUI still refer to JobTracker and 
TaskTracker, e.g.,

- mapreduce.jobtracker.handler.count,
- mapreduce.tasktracker.map.tasks.maximum,
- mapreduce.tasktracker.reduce.tasks.maximum

I wonder whether these and a number of other mapreduce.* (e.g., 
mapreduce.job.reduces) properties are observed by the MR2 
ApplicationMaster or not.


Can anyone clarify or point to respective documentation?

Thanks,
Martin


Hadoop idioms for reporting cluster and counter stats.

2012-08-16 Thread Jay Vyas
Hi guys : I want to start automating the output of  counter stats, cluster
size, etc... at the end of the main map reduce jobs which we run.  Is there
a simple way to do this ?

Here is my current thought :

1) Run all jobs from a driver class (we already do this).

2) At the end of each job, intercept the global counters and write them out
to a text file.  This would
presumably be on the local fs.

3) Export the local filesystem.

4) Maybe the NameNode also has access to such data , maybe via an API
(clearly, the hadoop web ui gets this
data from somewhere, re in the cluster summary header..


-- 
Jay Vyas
MMSB/UCHC


Re: Number of Maps running more than expected

2012-08-16 Thread Bertrand Dechoux
Also could you tell us more about your task statuses?
You might also have failed tasks...


Bertrand

On Thu, Aug 16, 2012 at 11:01 PM, Bertrand Dechoux decho...@gmail.comwrote:

 Well, there is speculative executions too.

 http://developer.yahoo.com/hadoop/tutorial/module4.html

 *Speculative execution:* One problem with the Hadoop system is that by
 dividing the tasks across many nodes, it is possible for a few slow nodes
 to rate-limit the rest of the program. For example if one node has a slow
 disk controller, then it may be reading its input at only 10% the speed of
 all the other nodes. So when 99 map tasks are already complete, the system
 is still waiting for the final map task to check in, which takes much
 longer than all the other nodes.
 By forcing tasks to run in isolation from one another, individual tasks
 do not know *where* their inputs come from. Tasks trust the Hadoop
 platform to just deliver the appropriate input. Therefore, the same input
 can be processed *multiple times in parallel*, to exploit differences in
 machine capabilities. As most of the tasks in a job are coming to a close,
 the Hadoop platform will schedule redundant copies of the remaining tasks
 across several nodes which do not have other work to perform. This process
 is known as *speculative execution*. When tasks complete, they announce
 this fact to the JobTracker. Whichever copy of a task finishes first
 becomes the definitive copy. If other copies were executing speculatively,
 Hadoop tells the TaskTrackers to abandon the tasks and discard their
 outputs. The Reducers then receive their inputs from whichever Mapper
 completed successfully, first.
 Speculative execution is enabled by default. You can disable speculative
 execution for the mappers and reducers by setting the
 mapred.map.tasks.speculative.execution and
 mapred.reduce.tasks.speculative.execution JobConf options to false,
 respectively.



 Can you tell us your configuration with regards to those parameters?

 Regards

 Bertrand

 On Thu, Aug 16, 2012 at 8:36 PM, in.abdul in.ab...@gmail.com wrote:

 Hi Gaurav,
Number map is not depents upon number block . It is really depends upon
 number of input splits . If you had 100GB of data and you had 10 split
 means then you can see only 10 maps .

 Please correct me if i am wrong

 Thanks and regards,
 Syed abdul kather
 On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] 
 ml-node+s472066n4001631...@n3.nabble.com wrote:

  Hi users,
 
  I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
  the 12 nodes and 1 node running the Job Tracker).
  In order to perform a WordCount benchmark test, I did the following:
 
 - Executed RandomTextWriter first to create 100 GB data (Note that
 I
 have changed the test.randomtextwrite.total_bytes parameter only,
 rest
 all are kept default).
 - Next, executed the WordCount program for that 100 GB dataset.
 
  The Block Size in hdfs-site.xml is set as 128 MB. Now, according to
 my
  calculation, total number of Maps to be executed by the wordcount job
  should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
  But when I am executing the job, it is running a total number of 900
 Maps,
  i.e., 100 extra.
  So, why this extra number of Maps? Although, my job is completing
  successfully without any error.
 
  Again, if I don't execute the RandomTextWwriter job to create data for
  my wordcount, rather I put my own 100 GB text file in HDFS and run
  WordCount, I can then see the number of Maps are equivalent to my
  calculation, i.e., 800.
 
  Can anyone tell me why this odd behaviour of Hadoop regarding the number
  of Maps for WordCount only when the dataset is generated by
  RandomTextWriter? And what is the purpose of these extra number of Maps?
 
  Regards,
  Gaurav Dasgupta
 
 
  --
   If you reply to this email, your message will be added to the
 discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
   To unsubscribe from Lucene, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 
  .
  NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 




 -
 THANKS AND REGARDS,
 SYED ABDUL KATHER
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




 --
 Bertrand Dechoux




-- 
Bertrand Dechoux


Re: Number of Maps running more than expected

2012-08-16 Thread Raj Vishwanathan
You probably have speculative execution on. Extra maps and reduce tasks are run 
in case some of them fail

Raj


Sent from my iPad
Please excuse the typos. 

On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote:

 Hi Gaurav,
   Number map is not depents upon number block . It is really depends upon
 number of input splits . If you had 100GB of data and you had 10 split
 means then you can see only 10 maps .
 
 Please correct me if i am wrong
 
 Thanks and regards,
 Syed abdul kather
 On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] 
 ml-node+s472066n4001631...@n3.nabble.com wrote:
 
 Hi users,
 
 I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
 the 12 nodes and 1 node running the Job Tracker).
 In order to perform a WordCount benchmark test, I did the following:
 
   - Executed RandomTextWriter first to create 100 GB data (Note that I
   have changed the test.randomtextwrite.total_bytes parameter only, rest
   all are kept default).
   - Next, executed the WordCount program for that 100 GB dataset.
 
 The Block Size in hdfs-site.xml is set as 128 MB. Now, according to my
 calculation, total number of Maps to be executed by the wordcount job
 should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
 But when I am executing the job, it is running a total number of 900 Maps,
 i.e., 100 extra.
 So, why this extra number of Maps? Although, my job is completing
 successfully without any error.
 
 Again, if I don't execute the RandomTextWwriter job to create data for
 my wordcount, rather I put my own 100 GB text file in HDFS and run
 WordCount, I can then see the number of Maps are equivalent to my
 calculation, i.e., 800.
 
 Can anyone tell me why this odd behaviour of Hadoop regarding the number
 of Maps for WordCount only when the dataset is generated by
 RandomTextWriter? And what is the purpose of these extra number of Maps?
 
 Regards,
 Gaurav Dasgupta
 
 
 --
 If you reply to this email, your message will be added to the discussion
 below:
 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
 To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 -
 THANKS AND REGARDS,
 SYED ABDUL KATHER
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Number of Maps running more than expected

2012-08-16 Thread Mohit Anchlia
It would be helpful to see some statistics out of both the jobs like bytes
read, written number of errors etc.

On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 You probably have speculative execution on. Extra maps and reduce tasks
 are run in case some of them fail

 Raj


 Sent from my iPad
 Please excuse the typos.

 On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote:

  Hi Gaurav,
Number map is not depents upon number block . It is really depends upon
  number of input splits . If you had 100GB of data and you had 10 split
  means then you can see only 10 maps .
 
  Please correct me if i am wrong
 
  Thanks and regards,
  Syed abdul kather
  On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] 
  ml-node+s472066n4001631...@n3.nabble.com wrote:
 
  Hi users,
 
  I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
  the 12 nodes and 1 node running the Job Tracker).
  In order to perform a WordCount benchmark test, I did the following:
 
- Executed RandomTextWriter first to create 100 GB data (Note that I
have changed the test.randomtextwrite.total_bytes parameter only,
 rest
all are kept default).
- Next, executed the WordCount program for that 100 GB dataset.
 
  The Block Size in hdfs-site.xml is set as 128 MB. Now, according to
 my
  calculation, total number of Maps to be executed by the wordcount job
  should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
  But when I am executing the job, it is running a total number of 900
 Maps,
  i.e., 100 extra.
  So, why this extra number of Maps? Although, my job is completing
  successfully without any error.
 
  Again, if I don't execute the RandomTextWwriter job to create data for
  my wordcount, rather I put my own 100 GB text file in HDFS and run
  WordCount, I can then see the number of Maps are equivalent to my
  calculation, i.e., 800.
 
  Can anyone tell me why this odd behaviour of Hadoop regarding the number
  of Maps for WordCount only when the dataset is generated by
  RandomTextWriter? And what is the purpose of these extra number of Maps?
 
  Regards,
  Gaurav Dasgupta
 
 
  --
  If you reply to this email, your message will be added to the discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
  To unsubscribe from Lucene, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 
  .
  NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 
  -
  THANKS AND REGARDS,
  SYED ABDUL KATHER
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.