Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Could someone please help me answer this question?

On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 What is the corresponding system property for setNumTasks? Can it be used
 explicitly as system property like mapred.tasks.?


Re: hadoop permission guideline

2012-03-22 Thread Suresh Srinivas
Can you please take this discussion CDH mailing list?

On Mar 22, 2012, at 7:51 AM, Michael Wang michael.w...@meredith.com wrote:

 I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to 
 install all needed packages. When it was installed, the root is used.  I 
 found the installation created some users, such as hdfs, hive, 
 mapred,hue,hbase...
 After the installation, should we change some permission or ownership of some 
 directories/files? For example, to use HIVE. It works fine with root user, 
 since the metatore directory belongs to root. But in order to let other user 
 use HIVE, I have to change metastore ownership to a specific non-root user, 
 then it works. Is it the best practice?
 Another example is the start-all.sh, stop-all.sh they all belong to root. 
 Should I change them to other user? I guess there are more cases...
 
 Thanks,
 
 
 
 This electronic message, including any attachments, may contain proprietary, 
 confidential or privileged information for the sole use of the intended 
 recipient(s). You are hereby notified that any unauthorized disclosure, 
 copying, distribution, or use of this message is prohibited. If you have 
 received this message in error, please immediately notify the sender by reply 
 e-mail and delete it.


Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
confusing as to what it's purpose is for? I tried setting it for my job
still I see more map tasks running than *mapred.map.tasks*

On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote:

 There isn't such an API as setNumTasks. There is however,
 setNumReduceTasks, which sets mapred.reduce.tasks.

 Does this answer your question?

 On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Could someone please help me answer this question?
 
  On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  What is the corresponding system property for setNumTasks? Can it be
 used
  explicitly as system property like mapred.tasks.?



 --
 Harsh J



Re: setNumTasks

2012-03-22 Thread Bejoy Ks
Hi Mohit
  The number of map tasks is determined by your number of input splits
and the Input Format used by your MR job. Setting this value won't help you
control the same. AFAIK it would get effective if the value in
mapred.map.tasks is greater than the no of tasks calculated by the Job
based on the splits and Input Format.

Regards
Bejoy KS

On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
 confusing as to what it's purpose is for? I tried setting it for my job
 still I see more map tasks running than *mapred.map.tasks*

 On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote:

  There isn't such an API as setNumTasks. There is however,
  setNumReduceTasks, which sets mapred.reduce.tasks.
 
  Does this answer your question?
 
  On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   Could someone please help me answer this question?
  
   On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  
   What is the corresponding system property for setNumTasks? Can it be
  used
   explicitly as system property like mapred.tasks.?
 
 
 
  --
  Harsh J
 



Re: hadoop permission guideline

2012-03-22 Thread Harsh J
Hi Michael,

Am moving your question to the scm-us...@cloudera.org group which is
home to the community of Cloudera Manager users. You will get better
responses here.

In case you wish to browse or subscribe to this group, visit
https://groups.google.com/a/cloudera.org/forum/#!forum/scm-users

(BCC'd common-user@)

On Thu, Mar 22, 2012 at 8:21 PM, Michael Wang michael.w...@meredith.com wrote:
 I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to
 install all needed packages. When it was installed, the root is used.  I
 found the installation created some users, such as hdfs, hive,
 mapred,hue,hbase...
 After the installation, should we change some permission or ownership of
 some directories/files? For example, to use HIVE. It works fine with root
 user, since the metatore directory belongs to root. But in order to let
 other user use HIVE, I have to change metastore ownership to a specific
 non-root user, then it works. Is it the best practice?
 Another example is the start-all.sh, stop-all.sh they all belong to
 root. Should I change them to other user? I guess there are more cases...

 Thanks,



 This electronic message, including any attachments, may contain
 proprietary, confidential or privileged information for the sole use of the
 intended recipient(s). You are hereby notified that any unauthorized
 disclosure, copying, distribution, or use of this message is prohibited. If
 you have received this message in error, please immediately notify the
 sender by reply e-mail and delete it.



--
Harsh J


Re: setNumTasks

2012-03-22 Thread Shi Yu
If you want to control the number of input splits at fine granularity, 
you could customize the NLineInputFormat. You need to determine the 
number of lines per each split.  Thus you need to know before is the 
number of lines in your input data, for instance, using


hadoop -text /input/dir/* | wc -l

will give you a number, lets assume it is N

If you have K number of nodes, each nodes has C number of core, 
basically you could start K*C number of mapper jobs.  And you want to 
further assume each mapper process 2 splits (in case that some jobs are 
finished earlier), therefore the optimal number of lines in 
NLineInputFormat is around


N/(2*K*C)

Thus might give you an optimal job balance.   Remember, the 
NLineInputFormat usually takes longer time than other input format to 
initialize, and the line split only concerns about number of lines, but 
is unaware about the content length per each line. Thus, in sequence 
data analysis is some lines are significantly longer than other lines, 
the mapper assigned with longer lines will be much slower than those 
assigned with smaller lines.  So randomly mixing short and long lines 
before split is more preferable.



Shi


On 3/22/2012 10:01 AM, Bejoy Ks wrote:

Hi Mohit
   The number of map tasks is determined by your number of input splits
and the Input Format used by your MR job. Setting this value won't help you
control the same. AFAIK it would get effective if the value in
mapred.map.tasks is greater than the no of tasks calculated by the Job
based on the splits and Input Format.

Regards
Bejoy KS

On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchliamohitanch...@gmail.comwrote:


Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
confusing as to what it's purpose is for? I tried setting it for my job
still I see more map tasks running than *mapred.map.tasks*

On Thu, Mar 22, 2012 at 7:53 AM, Harsh Jha...@cloudera.com  wrote:


There isn't such an API as setNumTasks. There is however,
setNumReduceTasks, which sets mapred.reduce.tasks.

Does this answer your question?

On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchliamohitanch...@gmail.com
wrote:

Could someone please help me answer this question?

On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchliamohitanch...@gmail.com
wrote:


What is the corresponding system property for setNumTasks? Can it be

used

explicitly as system property like mapred.tasks.?



--
Harsh J





Re: rack awareness and safemode

2012-03-22 Thread Patai Sangbutsarakum
I restarted the cluster yesterday with rack-awareness enable.
Things went well. confirm that there was no issues at all.

Thanks you all again.


On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum
silvianhad...@gmail.com wrote:
 Thanks you all.


 On Tue, Mar 20, 2012 at 2:44 PM, Harsh J ha...@cloudera.com wrote:
 John has already addressed your concern. I'd only like to add that
 fixing of replication violations does not require your NN to be in
 safe mode and it won't be. Your worry can hence be voided :)

 On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum
 patai.sangbutsara...@turn.com wrote:
 Thanks for your reply and script. Hopefully it still apply to 0.20.203
 As far as I play with test cluster. The balancer would take care of
 replica placement.
 I just don't want to fall into the situation that the hdfs sit in the
 safemode
 for hours and users can't use hadoop and start yelping.

 Let's hear from others.


 Thanks
 Patai


 On 3/20/12 1:27 PM, John Meagher john.meag...@gmail.com wrote:

ere's the script I used (all sorts of caveats about it assuming a
replication factor of 3 and no real error handling, etc)...

for f in `hadoop fsck / | grep Replica placement policy is violated
| head -n8 | awk -F: '{print $1}'`; do
    hadoop fs -setrep -w 4 $f
    hadoop fs -setrep 3 $f
done






 --
 Harsh J


Re: rack awareness and safemode

2012-03-22 Thread John Meagher
Make sure you run hadoop fsck /.  It should report a lot of blocks
with the replication policy violated.  In the sort term it isn't
anything to worry about and everything will work fine even with those
errors.  Run the script I sent out earlier to fix those errors and
bring everything into compliance with the new rack awareness setup.


On Thu, Mar 22, 2012 at 13:36, Patai Sangbutsarakum
silvianhad...@gmail.com wrote:
 I restarted the cluster yesterday with rack-awareness enable.
 Things went well. confirm that there was no issues at all.

 Thanks you all again.


 On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum
 silvianhad...@gmail.com wrote:
 Thanks you all.


 On Tue, Mar 20, 2012 at 2:44 PM, Harsh J ha...@cloudera.com wrote:
 John has already addressed your concern. I'd only like to add that
 fixing of replication violations does not require your NN to be in
 safe mode and it won't be. Your worry can hence be voided :)

 On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum
 patai.sangbutsara...@turn.com wrote:
 Thanks for your reply and script. Hopefully it still apply to 0.20.203
 As far as I play with test cluster. The balancer would take care of
 replica placement.
 I just don't want to fall into the situation that the hdfs sit in the
 safemode
 for hours and users can't use hadoop and start yelping.

 Let's hear from others.


 Thanks
 Patai


 On 3/20/12 1:27 PM, John Meagher john.meag...@gmail.com wrote:

ere's the script I used (all sorts of caveats about it assuming a
replication factor of 3 and no real error handling, etc)...

for f in `hadoop fsck / | grep Replica placement policy is violated
| head -n8 | awk -F: '{print $1}'`; do
    hadoop fs -setrep -w 4 $f
    hadoop fs -setrep 3 $f
done






 --
 Harsh J


Re: tasktracker/jobtracker.. expectation..

2012-03-22 Thread Bejoy Ks
Hi Patai
 JobTracker automatically handles this situation by attempting the task
on different nodes.Could you verify the number of attempts that these
failed tasks made. Was that just one? If more whether all the
task attempts were triggered on the same node or not? Did all of them fail
with the same error? You can get this information from the jobtracker web
UI, drill down to task level and then further down a failed task.

Regards
Bejoy

On Thu, Mar 22, 2012 at 11:25 PM, Patai Sangbutsarakum 
silvianhad...@gmail.com wrote:

 Hi all,

 I have a job fail this morning because of 2 tasks were trying to write
 into disk that somehow turned read-only.
 Originally, i was thinking/dreaming that in this case somehow those 2
 tasks will be exported automatically
 to other dn/tt that  also has the required data block, and won't fail.

 I strongly believe that Hadoop can do that but i just didn't know it
 well enough to enable it.

 /dev/sdj1 /hadoop10 ext3 ro,noatime,data=ordered 0 0

 Error initializing attempt_201203211854_2633_m_17_0: EROFS:
 Read-only file system at
 org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at

 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:496)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319)
 at
 org.apache.hadoop.mapred.JobLocalizer.createLocalDirs(JobLocalizer.java:144)
 at
 org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:190)
 at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
 at java.security.AccessController.doPrivileged(Native Method) at
 javax.security.auth.Subject.doAs(Subject.java:396) at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
 at
 org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
 at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
 at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
 at
 org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)

 Hope this make sense.
 Patai



Re: Number of retries

2012-03-22 Thread Bejoy KS
Mohit
  If you are writing to a db from a job in an atomic way, this would pop 
up. You can avoid this only by disabling speculative execution. 
Drilling down from web UI to a task level would get you the tasks where 
multiple attempts were there.

--Original Message--
From: Mohit Anchlia
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Number of retries
Sent: Mar 23, 2012 01:21

I am seeing wierd problem where I am seeing duplicate rows in the database.
I am wondering if this is because of some internal retries that might be
causing this. Is there a way to look at which tasks were retried? I am not
sure what else might cause because when I look at the output data I don't
see any duplicates in the file.



Regards
Bejoy KS

Sent from handheld, please excuse typos.


Re: Number of retries

2012-03-22 Thread Bejoy KS
Hi Mohit
 To add on, duplicates won't be there if your output is written to a hdfs 
file. Because if one attempt of a task is completed only that output file is 
copied to the final output destn and the files generated by other task attempts 
that are killed are just ignored.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Bejoy KS bejoy.had...@gmail.com
Date: Thu, 22 Mar 2012 19:55:55 
To: common-user@hadoop.apache.org
Reply-To: bejoy.had...@gmail.com
Subject: Re: Number of retries

Mohit
  If you are writing to a db from a job in an atomic way, this would pop 
up. You can avoid this only by disabling speculative execution. 
Drilling down from web UI to a task level would get you the tasks where 
multiple attempts were there.

--Original Message--
From: Mohit Anchlia
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Number of retries
Sent: Mar 23, 2012 01:21

I am seeing wierd problem where I am seeing duplicate rows in the database.
I am wondering if this is because of some internal retries that might be
causing this. Is there a way to look at which tasks were retried? I am not
sure what else might cause because when I look at the output data I don't
see any duplicates in the file.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

number of partitions

2012-03-22 Thread Harun Raşit ER
I wrote a custom partitioner. But when I work as standalone or
pseudo-distributed mode, the number of partitions is always 1. I set the
numberOfReducer to 4, but the numOfPartitions parameter of custom
partitioner is still 1 and all my four mappers' results are going to 1
reducer. The other reducers yield empty files.

How can i set the number of partitions in standalone or pseudo-distributed
mode?

thanks for your helps.


hadoop on cygwin : tasktrakker is throwing error : need helpv

2012-03-22 Thread Santosh Borse
I have installed hadoop on cygwin to help me to write MR code in windows 
eclipse.


2012-03-22 22:19:57,896 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.io.IOException: Failed to set permissions of 
path: \tmp\hadoop-uygwin\mapred\local\ttprivate to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:726)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1457)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3716)

2012-03-22 22:19:57,897 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:



Config details
-
OS : Win 7
Hadoop :  hadoop-1.0.1


Please let me know if you can help.


-Santosh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.