Re: How to verify all my master/slave name/data nodes have been configured correctly?

2012-03-08 Thread madhu phatak
Hi,
 Use the JobTracker WEB UI at master:50030 and Namenode web UI at
master:50070.

On Fri, Feb 10, 2012 at 9:03 AM, Wq Az azq...@gmail.com wrote:

 Hi,
 Is there a quick way to check this?
 Thanks ahead,
 Will




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Can I start a Hadoop job from an EJB?

2012-03-08 Thread madhu phatak
Yes you can . Please make sure all Hadoop jars and conf directory is in
classpath.

On Thu, Feb 9, 2012 at 7:02 AM, Sanjeev Verma sanjeev.x.ve...@gmail.comwrote:

 This is based on my understanding and no real life experience, so going to
 go out on a limb here :-)...assuming that you are planning on kicking off
 this map-reduce job based on a event of sorts (a file arrived and is ready
 to be processed?), and no direct user wait is involved, then yes, I would
 imagine you should be able to do something like this from inside a MDB
 (asynchronous so no one is held up in queue). Some random thoughts:

 1. The user under which the app server is running will need to be a setup
 as a hadoop client user - this is rather obvious, just wanted to list it
 for completeness.
 2. Hadoop, AFAIK, does not support transactions, and no XA. I assume you
 have no need for any of that stuff either.
 3. Your MDB could potentially log job start/end times, but that info is
 available from Hadoop's monitoring infrastructure also.

 I would be very interested in hearing what senior members on the list have
 to say...

 HTH

 Sanjeev

 On Wed, Feb 8, 2012 at 2:18 PM, Andy Doddington a...@doddington.net
 wrote:

  OK, I have a working Hadoop application that I would like to integrate
  into an application
  server environment. So, the question arises: can I do this? E.g. can I
  create a JobClient
  instance inside an EJB and run it in the normal way, or is something more
  complex
  required? In addition, are there any unpleasant interactions between the
  application
  server and the hadoop runtime?
 
  Thanks for any guidance.
 
 Andy D.




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: Standalone operation - file permission, Pseudo-Distributed operation - no output

2012-03-08 Thread Jagat
Hello

Can you please tell which version of Hadoop you are using and also

Does your error matches with below message?

Failed to set permissions of path:
file:/tmp/hadoop-jj/mapred/staging/jj-1931875024/.staging to 0700

Thanks
Jagat


On Thu, Mar 8, 2012 at 5:10 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
 Just make sure both task tracker and data node is up. Go to localhost:50030
 and see is it shows no.of nodes equal to 1?

 On Thu, Feb 9, 2012 at 9:18 AM, Kyong-Ho Min kyong-ho@sydney.edu.au
 wrote:

  Hello,
 
  I am a hadoop newbie and I have 2 questions.
 
  I followed Hadoop standalone mode testing.
  I got error message from Cygwin terminal  like file permission error.
  I checked out mailing list and changed the part in
 RawLocalFileSystem.java
  but not working.
  Still I have file permission error in the directory:
  c:/tmp/hadoop../mapred/staging...
 
 
  I followed instruction about Pseudo-Distributed operation.
  Ssh is OK and namenode -format is OK.
  But it did not return any results and the processing is just halted.
  The Cygwin console scripts are
 
  -
  $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
  12/02/09 14:25:44 INFO mapred.FileInputFormat: Total input paths to
  process : 17
  12/02/09 14:25:44 INFO mapred.JobClient: Running job:
 job_201202091423_0001
  12/02/09 14:25:45 INFO mapred.JobClient:  map 0% reduce 0%
  -
 
  Any help pls.
  Thanks.
 
  Kyongho Min
 



 --
 https://github.com/zinnia-phatak-dev/Nectar



Convergence on File Format?

2012-03-08 Thread Michal Klos
Hi,

It seems that  Avro is poised to become the file format, is that still the 
case?

We've looked at Text, RCFile and Avro. Text is nice, but we'd really need to 
extend it. RCFile is great for Hive, but it has been a challenge using it 
outside of Hive. Avro has a great feature set, but is comparably (to RCFile) 
significantly slower and larger on disk in our testing, but if it has the 
highest rate of development, it may be the right choice.

If you were choosing a File Format today to build a general purpose cluster 
(general purpose in the sense of using all the Hadoop tools, not just Hive), 
what would you choose? (one of the choices being development of a Custom format)

Thanks,

Mike



Re: Convergence on File Format?

2012-03-08 Thread Serge Blazhievsky
We started using Avro few month ago and results are great!

Easy to use, reliable, feature rich, great integration with MapReduce

On 3/8/12 3:07 PM, Michal Klos mk...@compete.com wrote:

Hi,

It seems that  Avro is poised to become the file format, is that still
the case?

We've looked at Text, RCFile and Avro. Text is nice, but we'd really need
to extend it. RCFile is great for Hive, but it has been a challenge using
it outside of Hive. Avro has a great feature set, but is comparably (to
RCFile) significantly slower and larger on disk in our testing, but if it
has the highest rate of development, it may be the right choice.

If you were choosing a File Format today to build a general purpose
cluster (general purpose in the sense of using all the Hadoop tools, not
just Hive), what would you choose? (one of the choices being development
of a Custom format)

Thanks,

Mike




Re: Profiling Hadoop Job

2012-03-08 Thread Leonardo Urbina
Does anyone have any idea how to solve this problem? Regardless of whether
I'm using plain HPROF or profiling through Starfish, I am getting the same
error:

Exception in thread main java.io.FileNotFoundException:
attempt_201203071311_0004_m_
00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:194)
at java.io.FileOutputStream.init(FileOutputStream.java:84)
at
org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at
org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at
com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

But I can't find what permissions to change to fix this issue. Any ideas?
Thanks in advance,

Best,
-Leo


On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote:

 Thanks,
 -Leo


 On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote:

 Hi Leo,

 Thanks for pointing out the outdated README file.  Glad to tell you that
 we
 do support the old API in the latest version. See here:

 http://www.cs.duke.edu/starfish/previous.html

 Welcome to join our mailing list and your questions will reach more of our
 group members.

 Jie

 On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu wrote:

  Hi Jie,
 
  According to the Starfish README, the hadoop programs must be written
 using
  the new Hadoop API. This is not my case (I am using MultipleInputs among
  other non-new API supported features). Is there any way around this?
  Thanks,
 
  -Leo
 
  On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote:
 
   Hi Leonardo,
  
   You might want to try Starfish which supports the memory profiling as
  well
   as cpu/disk/network profiling for the performance tuning.
  
   Jie
   --
   Starfish is an intelligent performance tuning tool for Hadoop.
   Homepage: www.cs.duke.edu/starfish/
   Mailing list: http://groups.google.com/group/hadoop-starfish
  
  
   On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu
 wrote:
  
Hello everyone,
   
I have a Hadoop job that I run on several GBs of data that I am
 trying
  to
optimize in order to reduce the memory consumption as well as
 improve
  the
speed. I am following the steps outlined in Tom White's Hadoop: The
Definitive Guide for profiling using HPROF (p161), by setting the
following properties in the JobConf:
   
   job.setProfileEnabled(true);
   
   
 job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6,
  +
   force=n,thread=y,verbose=n,file=%s);
   job.setProfileTaskRange(true, 0-2);
   job.setProfileTaskRange(false, 0-2);
   
I am trying to run this locally on a single pseudo-distributed
 install
  of
hadoop (0.20.2) and it gives the following error:
   
Exception in thread main java.io.FileNotFoundException:
attempt_201203071311_0004_m_00_0.profile (Permission denied)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:194)
   at java.io.FileOutputStream.init(FileOutputStream.java:84)
   at
   
 org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
   at
   
  
 
 org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
   at
  org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
   at
   
   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
   
   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)
   at
   
   
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at
   
   
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
   
However, I can access these logs directly from the tasktracker's
 logs
(through the web UI). For the sakes of  running this locally, I
 could
   just
   

Re: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Matt Foley
Hi Pawan,
The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
However, that does a bunch of stuff you don't need, like generate the
documentation and do a ton of cross-checks.

The full set of ant build targets are defined in build.xml in the top level
of the source code tree.
binary may be the target you want.

--Matt

On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.comwrote:

 Hi,

 I am trying to generate hadoop binaries from source and execute hadoop from
 the build I generate. I am able to build, however I am seeing that as part
 of build *bin* folder which comes with hadoop installation is not generated
 in my build. Can someone tell me how to do a build so that I can generate
 build equivalent to hadoop release build and which can be used directly to
 run hadoop.

 Here's the details.
 Desktop: Ubuntu Server 11.10
 Hadoop version for installation: 0.20.203.0  (link:
 http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
 Hadoop Branch used build: branch-0.20-security-203
 Build Command used: Ant maven-install

 Here's the directory structures from build I generated vs hadoop official
 release build.

 *Hadoop directory which I generated:*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1
 ant
 c++
 classes
 contrib
 examples
 hadoop-0.20-security-203-pawan
 hadoop-ant-0.20-security-203-pawan.jar
 hadoop-core-0.20-security-203-pawan.jar
 hadoop-examples-0.20-security-203-pawan.jar
 hadoop-test-0.20-security-203-pawan.jar
 hadoop-tools-0.20-security-203-pawan.jar
 ivy
 jsvc
 src
 test
 tools
 webapps

 *Official Hadoop build installation*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1
 bin
 build.xml
 c++
 CHANGES.txt
 conf
 contrib
 docs
 hadoop-ant-0.20.203.0.jar
 hadoop-core-0.20.203.0.jar
 hadoop-examples-0.20.203.0.jar
 hadoop-test-0.20.203.0.jar
 hadoop-tools-0.20.203.0.jar
 input
 ivy
 ivy.xml
 lib
 librecordio
 LICENSE.txt
 logs
 NOTICE.txt
 README.txt
 src
 webapps



 Any pointers for help are greatly appreciated?

 Also, if there are any other resources for understanding hadoop build
 system, pointers to that would be also helpful.

 Thanks
 Pawan



Re: Convergence on File Format?

2012-03-08 Thread Russell Jurney
Avro support in Pig will be fairly mature in 0.10.

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Mar 8, 2012, at 3:10 PM, Serge Blazhievsky
serge.blazhiyevs...@nice.com wrote:

 We started using Avro few month ago and results are great!

 Easy to use, reliable, feature rich, great integration with MapReduce

 On 3/8/12 3:07 PM, Michal Klos mk...@compete.com wrote:

 Hi,

 It seems that  Avro is poised to become the file format, is that still
 the case?

 We've looked at Text, RCFile and Avro. Text is nice, but we'd really need
 to extend it. RCFile is great for Hive, but it has been a challenge using
 it outside of Hive. Avro has a great feature set, but is comparably (to
 RCFile) significantly slower and larger on disk in our testing, but if it
 has the highest rate of development, it may be the right choice.

 If you were choosing a File Format today to build a general purpose
 cluster (general purpose in the sense of using all the Hadoop tools, not
 just Hive), what would you choose? (one of the choices being development
 of a Custom format)

 Thanks,

 Mike




Re: Profiling Hadoop Job

2012-03-08 Thread Mohit Anchlia
Can you check which user you are running this process as and compare it
with the ownership on the directory?

On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina lurb...@mit.edu wrote:

 Does anyone have any idea how to solve this problem? Regardless of whether
 I'm using plain HPROF or profiling through Starfish, I am getting the same
 error:

 Exception in thread main java.io.FileNotFoundException:
 attempt_201203071311_0004_m_
 00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:194)
at java.io.FileOutputStream.init(FileOutputStream.java:84)
at
 org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at
 org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at

 com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at

 com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 But I can't find what permissions to change to fix this issue. Any ideas?
 Thanks in advance,

 Best,
 -Leo


  On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote:

  Thanks,
  -Leo
 
 
  On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote:
 
  Hi Leo,
 
  Thanks for pointing out the outdated README file.  Glad to tell you that
  we
  do support the old API in the latest version. See here:
 
  http://www.cs.duke.edu/starfish/previous.html
 
  Welcome to join our mailing list and your questions will reach more of
 our
  group members.
 
  Jie
 
  On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu
 wrote:
 
   Hi Jie,
  
   According to the Starfish README, the hadoop programs must be written
  using
   the new Hadoop API. This is not my case (I am using MultipleInputs
 among
   other non-new API supported features). Is there any way around this?
   Thanks,
  
   -Leo
  
   On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote:
  
Hi Leonardo,
   
You might want to try Starfish which supports the memory profiling
 as
   well
as cpu/disk/network profiling for the performance tuning.
   
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish
   
   
On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu
  wrote:
   
 Hello everyone,

 I have a Hadoop job that I run on several GBs of data that I am
  trying
   to
 optimize in order to reduce the memory consumption as well as
  improve
   the
 speed. I am following the steps outlined in Tom White's Hadoop:
 The
 Definitive Guide for profiling using HPROF (p161), by setting the
 following properties in the JobConf:

job.setProfileEnabled(true);


  job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6,
   +
force=n,thread=y,verbose=n,file=%s);
job.setProfileTaskRange(true, 0-2);
job.setProfileTaskRange(false, 0-2);

 I am trying to run this locally on a single pseudo-distributed
  install
   of
 hadoop (0.20.2) and it gives the following error:

 Exception in thread main java.io.FileNotFoundException:
 attempt_201203071311_0004_m_00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at
 java.io.FileOutputStream.init(FileOutputStream.java:194)
at
 java.io.FileOutputStream.init(FileOutputStream.java:84)
at

  org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at

   
  
 
 org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at
   org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at


   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at
 org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at


   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
  Method)
at


   
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at


   
  
 
 

RE: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Leo Leung
Hi Pawan,

  ant -p (not for 0.23+) will tell you the available build targets.

  Use mvn (maven) for 0.23 or newer



-Original Message-
From: Matt Foley [mailto:mfo...@hortonworks.com] 
Sent: Thursday, March 08, 2012 3:52 PM
To: common-user@hadoop.apache.org
Subject: Re: Why is hadoop build I generated from a release branch different 
from release build?

Hi Pawan,
The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
However, that does a bunch of stuff you don't need, like generate the 
documentation and do a ton of cross-checks.

The full set of ant build targets are defined in build.xml in the top level of 
the source code tree.
binary may be the target you want.

--Matt

On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.comwrote:

 Hi,

 I am trying to generate hadoop binaries from source and execute hadoop 
 from the build I generate. I am able to build, however I am seeing 
 that as part of build *bin* folder which comes with hadoop 
 installation is not generated in my build. Can someone tell me how to 
 do a build so that I can generate build equivalent to hadoop release 
 build and which can be used directly to run hadoop.

 Here's the details.
 Desktop: Ubuntu Server 11.10
 Hadoop version for installation: 0.20.203.0  (link:
 http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
 Hadoop Branch used build: branch-0.20-security-203 Build Command used: 
 Ant maven-install

 Here's the directory structures from build I generated vs hadoop 
 official release build.

 *Hadoop directory which I generated:*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant
 c++
 classes
 contrib
 examples
 hadoop-0.20-security-203-pawan
 hadoop-ant-0.20-security-203-pawan.jar
 hadoop-core-0.20-security-203-pawan.jar
 hadoop-examples-0.20-security-203-pawan.jar
 hadoop-test-0.20-security-203-pawan.jar
 hadoop-tools-0.20-security-203-pawan.jar
 ivy
 jsvc
 src
 test
 tools
 webapps

 *Official Hadoop build installation*
 pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 
 bin build.xml
 c++
 CHANGES.txt
 conf
 contrib
 docs
 hadoop-ant-0.20.203.0.jar
 hadoop-core-0.20.203.0.jar
 hadoop-examples-0.20.203.0.jar
 hadoop-test-0.20.203.0.jar
 hadoop-tools-0.20.203.0.jar
 input
 ivy
 ivy.xml
 lib
 librecordio
 LICENSE.txt
 logs
 NOTICE.txt
 README.txt
 src
 webapps



 Any pointers for help are greatly appreciated?

 Also, if there are any other resources for understanding hadoop build 
 system, pointers to that would be also helpful.

 Thanks
 Pawan



does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Jane Wayne
i am wondering if hadoop always respect Job.setNumReduceTasks(int)?

as i am emitting items from the mapper, i expect/desire only 1 reducer to
get these items because i want to assign each key of the key-value input
pair a unique integer id. if i had 1 reducer, i can just keep a local
counter (with respect to the reducer instance) and increment it.

on my local hadoop cluster, i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).

however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
i notice that there are multiple reducers. if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).

also, it looks like this might not be a good idea just having 1 reducer (it
won't scale). it is most likely better if there are +1 reducers, but in
that case, i lose the ability to assign unique numbers to the key-value
pairs coming in. is there a design pattern out there that addresses this
issue?

my mapper/reducer key-value pair signatures looks something like the
following.

mapper(Text, Text, Text, IntWritable)
reducer(Text, IntWritable, IntWritable, Text)

the mapper reads a sequence file whose key-value pairs are of type Text and
Text. i then emit Text (let's say a word) and IntWritable (let's say
frequency of the word).

the reducer gets the word and its frequencies, and then assigns the word an
integer id. it emits IntWritable (the id) and Text (the word).

i remember seeing code from mahout's API where they assign integer ids to
items. the items were already given an id of type long. the conversion they
make is as follows.

public static int idToIndex(long id) {
 return 0x7FFF  ((int) id ^ (int) (id  32));
}

is there something equivalent for Text or a word? i was thinking about
simply taking the hash value of the string/word, but of course, different
strings can map to the same hash value.


Re: does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Lance Norskog
Instead of String.hashCode() you can use the MD5 hashcode generator.
This has not in the wild created a duplicate. (It has been hacked,
but that's not relevant here.)

http://snippets.dzone.com/posts/show/3686

I think the Partitioner class guarantees that you will have multiple reducers.

On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i am wondering if hadoop always respect Job.setNumReduceTasks(int)?

 as i am emitting items from the mapper, i expect/desire only 1 reducer to
 get these items because i want to assign each key of the key-value input
 pair a unique integer id. if i had 1 reducer, i can just keep a local
 counter (with respect to the reducer instance) and increment it.

 on my local hadoop cluster, i noticed that most, if not all, my jobs have
 only 1 reducer, regardless of whether or not i set
 Job.setNumReduceTasks(int).

 however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
 i notice that there are multiple reducers. if i set the number of reduce
 tasks to 1, is this always guaranteed? i ask because i don't know if there
 is a gotcha like the combiner (where it may or may not run at all).

 also, it looks like this might not be a good idea just having 1 reducer (it
 won't scale). it is most likely better if there are +1 reducers, but in
 that case, i lose the ability to assign unique numbers to the key-value
 pairs coming in. is there a design pattern out there that addresses this
 issue?

 my mapper/reducer key-value pair signatures looks something like the
 following.

 mapper(Text, Text, Text, IntWritable)
 reducer(Text, IntWritable, IntWritable, Text)

 the mapper reads a sequence file whose key-value pairs are of type Text and
 Text. i then emit Text (let's say a word) and IntWritable (let's say
 frequency of the word).

 the reducer gets the word and its frequencies, and then assigns the word an
 integer id. it emits IntWritable (the id) and Text (the word).

 i remember seeing code from mahout's API where they assign integer ids to
 items. the items were already given an id of type long. the conversion they
 make is as follows.

 public static int idToIndex(long id) {
  return 0x7FFF  ((int) id ^ (int) (id  32));
 }

 is there something equivalent for Text or a word? i was thinking about
 simply taking the hash value of the string/word, but of course, different
 strings can map to the same hash value.



-- 
Lance Norskog
goks...@gmail.com


Best way for setting up a large cluster

2012-03-08 Thread Masoud

Hi all,

I installed hadoop in a pilot cluster with 3 machines and now going to 
make our actual cluster with 32 nodes.
as you know setting up hadoop separately in every nodes is time 
consuming and not perfect way.

whats the best way or tool to setup hadoop cluster (expect cloudera)?

Thanks,
B.S


Re: Best way for setting up a large cluster

2012-03-08 Thread Joey Echeverria
Something like puppet it is a good choice. There are example puppet
manifests available for most Hadoop-related projects in Apache BigTop,
for example:

https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/

-Joey

On Thu, Mar 8, 2012 at 9:42 PM, Masoud mas...@agape.hanyang.ac.kr wrote:
 Hi all,

 I installed hadoop in a pilot cluster with 3 machines and now going to make
 our actual cluster with 32 nodes.
 as you know setting up hadoop separately in every nodes is time consuming
 and not perfect way.
 whats the best way or tool to setup hadoop cluster (expect cloudera)?

 Thanks,
 B.S



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Getting different results every time I run the same job on the cluster

2012-03-08 Thread Mark Kerzner
Hi,

I have to admit, I am lost. My code http://frd.org/ is stable on a
pseudo distributed cluster, but every time I run it one a 4 - slave
cluster, I get different results, ranging from 100 output lines to 4,000
output lines, whereas the real answer on my standalone is about 2000.

I look at the logs and see no exceptions, so I am totally lost. Where
should I look?

Thank you,
Mark


Re: Why is hadoop build I generated from a release branch different from release build?

2012-03-08 Thread Pawan Agarwal
Thanks for all the replies. It turns out that build generated by ant has
bin  conf etc folders in one level above. And I looked at hadoop scripts
and apparently it looks for right jars both in root directory and
root/build/ directory as well. so I think I am covered for now.

Thanks again!

On Thu, Mar 8, 2012 at 4:15 PM, Leo Leung lle...@ddn.com wrote:

 Hi Pawan,

  ant -p (not for 0.23+) will tell you the available build targets.

  Use mvn (maven) for 0.23 or newer



 -Original Message-
 From: Matt Foley [mailto:mfo...@hortonworks.com]
 Sent: Thursday, March 08, 2012 3:52 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Why is hadoop build I generated from a release branch
 different from release build?

 Hi Pawan,
 The complete way releases are built (for v0.20/v1.0) is documented at
http://wiki.apache.org/hadoop/HowToRelease#Building
 However, that does a bunch of stuff you don't need, like generate the
 documentation and do a ton of cross-checks.

 The full set of ant build targets are defined in build.xml in the top
 level of the source code tree.
 binary may be the target you want.

 --Matt

 On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.com
 wrote:

  Hi,
 
  I am trying to generate hadoop binaries from source and execute hadoop
  from the build I generate. I am able to build, however I am seeing
  that as part of build *bin* folder which comes with hadoop
  installation is not generated in my build. Can someone tell me how to
  do a build so that I can generate build equivalent to hadoop release
  build and which can be used directly to run hadoop.
 
  Here's the details.
  Desktop: Ubuntu Server 11.10
  Hadoop version for installation: 0.20.203.0  (link:
  http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/)
  Hadoop Branch used build: branch-0.20-security-203 Build Command used:
  Ant maven-install
 
  Here's the directory structures from build I generated vs hadoop
  official release build.
 
  *Hadoop directory which I generated:*
  pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant
  c++
  classes
  contrib
  examples
  hadoop-0.20-security-203-pawan
  hadoop-ant-0.20-security-203-pawan.jar
  hadoop-core-0.20-security-203-pawan.jar
  hadoop-examples-0.20-security-203-pawan.jar
  hadoop-test-0.20-security-203-pawan.jar
  hadoop-tools-0.20-security-203-pawan.jar
  ivy
  jsvc
  src
  test
  tools
  webapps
 
  *Official Hadoop build installation*
  pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1
  bin build.xml
  c++
  CHANGES.txt
  conf
  contrib
  docs
  hadoop-ant-0.20.203.0.jar
  hadoop-core-0.20.203.0.jar
  hadoop-examples-0.20.203.0.jar
  hadoop-test-0.20.203.0.jar
  hadoop-tools-0.20.203.0.jar
  input
  ivy
  ivy.xml
  lib
  librecordio
  LICENSE.txt
  logs
  NOTICE.txt
  README.txt
  src
  webapps
 
 
 
  Any pointers for help are greatly appreciated?
 
  Also, if there are any other resources for understanding hadoop build
  system, pointers to that would be also helpful.
 
  Thanks
  Pawan
 



Hadoop-Pig setup question

2012-03-08 Thread Atul Thapliyal
Hi Hadoop users,

I am new member and please let me know if this is not the correct format to
ask questions.

I am trying to setup a small Hadoop cluster where I will run Pig queries.
Hadoop cluster is running fine but when I run a pig query it just hangs.

Note - Pig runs fine in local mode

So I narrowed down the errors to the following -

I have a secondary name node on a different machine (.e.g node 2).

Point 1.
When I execute start-mapred.sh on node 2, I get
ssh_exchange_identification closed by remote host message. BUT the
secondary name node starts with no error messages in the log
I can even access it through port 50030

So far no errors

Point 2.
When I try to run a map-reduce job. I get a java.net.ConnectException:
Connection refused error in the secondary name node log files.

Are point 1 and point 2 related

Any hints / pointers on how to solve this ? Also, ssh time out is set to 20
so I am assuming that error is not because of this.

Thanks for reading

-- 
Warm Regards
Atul


Hadoop node name problem

2012-03-08 Thread 韶隆吴
Hi All:
   I'm trying to use hadoop,zookeeper and hbase to build a NoSQL
database,but when I make hadoop and zookeeper work well and going to
install hbase,it report an exception:
BindException:Problem binding to /202.106.199.37:60020:Cannot assign
requested address
My PC IPHost is 192.168.1.91 slave1.
Then I search the http://192.168.1.90:50070
(master)/dfsnodelist.jsp?whatNodes=LIVE
I saw the message like this:
Node








web30http://web30.bbn.com.cn:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F









bt-199-036http://bt-199-036.bta.net.cn:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F









202.106.199.37http://202.106.199.37:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F









I want to know why the node's name like this and how to solve this


Re: Java Heap space error

2012-03-08 Thread hadoopman
I'm curious if you have been able to track down the cause of the error?  
We've seen similar problems with loading data and I've discovered if I 
presort my data before the load that things go a LOT smoother.


When running queries against our data sometimes we've seen it where the 
jobtracker just freezes.  I've seen Heap out of memory errors when I 
cranked up jobtracker logging to debug.  Still working on figuring this 
one out.  should be an interesting ride :D




On 03/06/2012 11:10 AM, Mohit Anchlia wrote:

I am still trying to see how to narrow this down. Is it possible to set
heapdumponoutofmemoryerror option on these individual tasks?

On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchliamohitanch...@gmail.comwrote:







state of HOD

2012-03-08 Thread Stijn De Weirdt
(my apologies for those who have received this already. i posted this 
mail a few days back on the common-dev list, as this is more a 
development related mail; but one of the original authors/maintainers 
suggested to also post this here)


hi all,

i am a system administrator/user support person/... for the HPC team at 
Ghent University (Ghent, Flanders, Belgium).


recently we have been asked to look into support for hadoop. for the 
moment we are holding off on a dedicated cluster (esp dedicated hdfs setup).


but as all our systems are torque/pbs based, we looked into HOD to help 
out our users.
we have started from the HOD code that was part of the hadoop 1.0.0 
release (in the contrib part).
at first it was not working, but we have been patching and cleaning up 
the code for a a few weeks and now have a version that works for us (we 
had to add some features besides fixing a few things).
it looks sufficient for now, although we will add some more features 
soon to get the users started.



my question is the following: what is the state of HOD atm? is it still 
maintained/supported? are there forks somewhere that have more 
up-to-date code?
what we are now missing most is the documentation (eg 
http://hadoop.apache.org/common/docs/r0.16.4/hod.html) so we can update 
this with our extra features. is the source available somewhere?


i could contribute back all patches, but a few of them are identation 
fixes (to use 4 space indentation throughout the code) and other 
cosmetic changes, so this messes up patches a lot.
i have also shuffled a bit with the options (rename and/or move to other 
sections) so no 100% backwards compatibility with the current HOD code.


current main improvements:
- works with python 2.5 and up (we have been testing with 2.7.2)
- set options through environment variables
- better default values (we can now run with empty hodrc file)
- support for mail and nodes:ppn for pbs
- no deprecation warnings from hadoop (nearly finished)
- host-mask to bind xrs addr on non-default ip (in case you have 
non-standard network on the compute nodes)

- more debug statements
- gradual code cleanup (using pylint)

on the todo list:
- further tuning of hadoop parameters (i'm not a hadoop user myself, so 
this will take some time)

- 0.23.X support



many thanks,

stijn