what change to be done in OutputCollector to print custom writable object

2009-04-01 Thread Deepak Diwakar
Hi,

I am learning how to make custom-writable working. So I have implemented a
simple MyWriitable class.

And  I can play with the MyWritable object within the Map-Reduce. but
suppose in Reduce Values are a type of MyWritable object and  I put them
into OutputCollector to get final output. Since value is a custom object I
can't get  them into file but a reference.

 What and where I have to make changes /additions so that print into file
function handles the custom-writable object?

Thanks  regards,
-- 
- Deepak Diwakar,


datanode but not tasktracker

2009-04-01 Thread Sandhya E
Hi

When the host is listed in slaves file both DataNode and TaskTracker
are started on that host. Is there a way in which we can configure a
node to be datanode and not tasktracker.

Thanks
Sandhya


Re: Please help!

2009-04-01 Thread Hadooper
Thanks, Ricky.
I am reading your site.

Richard


On Tue, Mar 31, 2009 at 4:59 PM, Ricky Ho r...@adobe.com wrote:

 I have written a blog about Hadoop's implementation couple months back here
 at ...
 http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

 Note that Hadoop is not about reducing latency.  It is about increasing
 throughput (not throughput per resource) by adding more machines in case
 your problem is data parallel.

 Time-wise:
 If it takes T seconds to process B amount of data, then by using Hadoop
 with N machines, you can process it within cT/N seconds where constant c  1
 accounts for the overhead.

 Space-wise:
 If it takes M amount of memory during the processing, then by using Hadoop
 with N machines, you need M/N + c

 Bandwidth-wise:
 You definitely need more bandwidth because a distributed file system is
 used.  And it also depends on your read / write ratio and how many ways of
 replication.  ... Need more time to think of the formula...

 Rgds,
 Ricky

 -Original Message-
 From: Hadooper [mailto:kusanagiyang.had...@gmail.com]
 Sent: Tuesday, March 31, 2009 3:35 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Please help!

 Thanks, Jim.
 I am very familiar with Google's original publication.

 On Tue, Mar 31, 2009 at 4:31 PM, Jim Twensky jim.twen...@gmail.com
 wrote:

  See the original Map Reduce paper by Google at
  http://labs.google.com/papers/mapreduce.html and please don't spam the
  list.
 
  -jim
 
  On Tue, Mar 31, 2009 at 6:15 PM, Hadooper kusanagiyang.had...@gmail.com
  wrote:
 
   Dear developers,
  
   Is there any detailed example of how Hadoop processes input?
   Article
   http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
   a good idea, but I want to see input data being passed from class to
   class, and how each class manipulates data. The purpose is to analyze
 the
   time and space complexity of Hadoop as a generalized computational
   model/algorithm. I tried to search the web and could not find more
  detail.
   Any pointer/hint?
   Thanks a million.
  
   --
   Cheers! Hadoop core
  
 



 --
 Cheers! Hadoop core




-- 
Cheers! Hadoop core


Re: Please help

2009-04-01 Thread Hadooper
Thanks for the reminder.


On Tue, Mar 31, 2009 at 5:37 PM, Amandeep Khurana ama...@gmail.com wrote:

 Have you read the Map Reduce paper? You might be able to find some pointers
 there for your analysis.



 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Mar 31, 2009 at 4:28 PM, Hadooper kusanagiyang.had...@gmail.com
 wrote:

  Dear developers,
 
  Is there any detailed example of how Hadoop processes input?
  Article
  http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
  a good idea, but I want to see input data being passed from class to
  class, and how each class manipulates data. The purpose is to analyze the
  time and space complexity of Hadoop as a generalized computational
  model/algorithm. I tried to search the web and could not find more
 detail.
  Any pointer/hint?
  Thanks a million.
 
  --
  Cheers! Hadoop core
 




-- 
Cheers! Hadoop core


Re: what change to be done in OutputCollector to print custom writable object

2009-04-01 Thread Enis Soztutar

Deepak Diwakar wrote:

Hi,

I am learning how to make custom-writable working. So I have implemented a
simple MyWriitable class.

And  I can play with the MyWritable object within the Map-Reduce. but
suppose in Reduce Values are a type of MyWritable object and  I put them
into OutputCollector to get final output. Since value is a custom object I
can't get  them into file but a reference.

 What and where I have to make changes /additions so that print into file
function handles the custom-writable object?

Thanks  regards,
  

just implement toString() in your MyWritable class.


Re: datanode but not tasktracker

2009-04-01 Thread Jean-Daniel Cryans
Sandhya,

You can specify the file to use for slaves so instead of start-all you
can start-dfs with the normal slave file and start-mapred with a
specified file on the command line.

J-D

On Wed, Apr 1, 2009 at 3:58 AM, Sandhya E sandhyabhas...@gmail.com wrote:
 Hi

 When the host is listed in slaves file both DataNode and TaskTracker
 are started on that host. Is there a way in which we can configure a
 node to be datanode and not tasktracker.

 Thanks
 Sandhya



RE: Eclipse version for Hadoop-0.19.1

2009-04-01 Thread Puri, Aseem
Hi
Please tell which eclipse version should I use which support 
hadoop-0.19.0-eclipse-plugin and from where I can download it?


-Original Message-
From: Rasit OZDAS [mailto:rasitoz...@gmail.com] 
Sent: Friday, March 20, 2009 9:19 PM
To: core-user@hadoop.apache.org
Subject: Re: Eclipse version for Hadoop-0.19.1

I also couldn't succeed in running it in ganymede.
I use eclipse europa with v. 0.19.0. I would give it a try for 19.1, though.

2009/3/18 Puri, Aseem aseem.p...@honeywell.com

 I am using Hadoop - HBase 0.18 and my eclipse supports
 hadoop-0.18.0-eclipse-plugin.



   When I switch to Hadoop 0.19.1 and use
 hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce
 perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1.



 Can anyone pls tell which version of eclipse supports Hadoop 0.19.1?





 Thanks  Regards

 Aseem Puri








-- 
M. Raşit ÖZDAŞ


Re: datanode but not tasktracker

2009-04-01 Thread Bill Au
Not sure why you would want a node be a datanode but not a tasktracker
because you normally would want the map/reduce task to run where the data is
stored.

Bill

On Wed, Apr 1, 2009 at 3:58 AM, Sandhya E sandhyabhas...@gmail.com wrote:

 Hi

 When the host is listed in slaves file both DataNode and TaskTracker
 are started on that host. Is there a way in which we can configure a
 node to be datanode and not tasktracker.

 Thanks
 Sandhya



RE: Eclipse version for Hadoop-0.19.1

2009-04-01 Thread Puri, Aseem
Rasit,
 Thanks, I got eclipse europa and it is also supporting map reduce 
perspective.

Sim

-Original Message-
From: Rasit OZDAS [mailto:rasitoz...@gmail.com] 
Sent: Wednesday, April 01, 2009 8:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Eclipse version for Hadoop-0.19.1

Try this page for eclipse europa:
http://rm.mirror.garr.it/mirrors/eclipse/technology/epp/downloads/release/europa/winter/

This is the fastest for me,
if it's too slow, you can download from here:
http://archive.eclipse.org/eclipse/downloads/

Rasit

2009/4/1 Puri, Aseem aseem.p...@honeywell.com

 Hi
Please tell which eclipse version should I use which support
 hadoop-0.19.0-eclipse-plugin and from where I can download it?


 -Original Message-
 From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
 Sent: Friday, March 20, 2009 9:19 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Eclipse version for Hadoop-0.19.1

 I also couldn't succeed in running it in ganymede.
 I use eclipse europa with v. 0.19.0. I would give it a try for 19.1,
 though.

 2009/3/18 Puri, Aseem aseem.p...@honeywell.com

  I am using Hadoop - HBase 0.18 and my eclipse supports
  hadoop-0.18.0-eclipse-plugin.
 
 
 
When I switch to Hadoop 0.19.1 and use
  hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce
  perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1.
 
 
 
  Can anyone pls tell which version of eclipse supports Hadoop 0.19.1?
 
 
 
 
 
  Thanks  Regards
 
  Aseem Puri
 
 
 
 
 
 


 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ


Cannot resolve Datonode address in slave file

2009-04-01 Thread Puri, Aseem
Hi

I have a small Hadoop cluster with 3 machines. One is my
NameNode/JobTracker + DataNode/TaskTracker and other 2 are
DataNode/TaskTracker. So I have made all 3 as slave. 

 

In slave file I have put names of all there machines as:

 

master

slave

slave1

 

When I start Hadoop cluster it always start DataNode/TaskTracker on last
slave in the list and do not start DataNode/TaskTracker on other two
machines. Also I got the message as:

 

slave1:

: no address associated with name

: no address associated with name

slave1: starting datanode, logging to
/home/HadoopAdmin/hadoop/bin/../logs/hadoo

p-HadoopAdmin-datanode-ie11dtxpficbfise.out

 

If I change the order in slave file like this:

 

slave

slave1

master

 

then DataNode/TaskTracker on master m/c starts and not on other two.

 

Please tell how I should solve this problem.

 

Sim



Running MapReduce without setJar

2009-04-01 Thread Farhan Husain
Hello,

Can anyone tell me if there is any way running a map-reduce job from a java
program without specifying the jar file by JobConf.setJar() method?

Thanks,

-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


Re: Running MapReduce without setJar

2009-04-01 Thread javateck javateck
I think you need to set a property (mapred.jar) inside hadoop-site.xml, then
you don't need to hardcode in your java code, and it will be fine.
But I don't know if there is any way that we can set multiple jars, since a
lot of times our own mapreduce class needs to reference other jars.

On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com wrote:

 Hello,

 Can anyone tell me if there is any way running a map-reduce job from a java
 program without specifying the jar file by JobConf.setJar() method?

 Thanks,

 --
 Mohammad Farhan Husain
 Research Assistant
 Department of Computer Science
 Erik Jonsson School of Engineering and Computer Science
 University of Texas at Dallas



Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
I checked out hadoop-core-0.19
export CFLAGS=$CUSTROOT/include
export LDFLAGS=$CUSTROOT/lib

(they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
liblzo.so.2.0.0

I then run (from hadoop-core-0.19.1/)
ant -Dcompile.native=true

I get messages like : (many others like this)
exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
rejected by the preprocessor!
 [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
compiler's result
 [exec] checking for lzo/lzo1x.h... yes
 [exec] checking Checking for the 'actual' dynamic-library for
'-llzo2'... (cached)
 [exec] checking lzo/lzo1y.h usability... yes
 [exec] checking lzo/lzo1y.h presence... no
 [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
rejected by the preprocessor!
 [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
compiler's result
 [exec] checking for lzo/lzo1y.h... yes
 [exec] checking Checking for the 'actual' dynamic-library for
'-llzo2'... (cached)

and finally,
ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
-o .libs/LzoCompressor.o
 [exec] 
/ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
 [exec] 
/ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
error: expected expression before ',' token


Any ideas?
Saptarshi Guha


Re: Running MapReduce without setJar

2009-04-01 Thread Farhan Husain
Can I get rid of the whole jar thing? Is there any way to run map reduce
programs without using a jar? I do not want to use hadoop jar ... either.

On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck javat...@gmail.comwrote:

 I think you need to set a property (mapred.jar) inside hadoop-site.xml,
 then
 you don't need to hardcode in your java code, and it will be fine.
 But I don't know if there is any way that we can set multiple jars, since a
 lot of times our own mapreduce class needs to reference other jars.

 On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com wrote:

  Hello,
 
  Can anyone tell me if there is any way running a map-reduce job from a
 java
  program without specifying the jar file by JobConf.setJar() method?
 
  Thanks,
 
  --
  Mohammad Farhan Husain
  Research Assistant
  Department of Computer Science
  Erik Jonsson School of Engineering and Computer Science
  University of Texas at Dallas
 




-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


Re: Running MapReduce without setJar

2009-04-01 Thread javateck javateck
you can run from java program:

JobConf conf = new JobConf(MapReduceWork.class);

// setting your params

JobClient.runJob(conf);


On Wed, Apr 1, 2009 at 11:42 AM, Farhan Husain russ...@gmail.com wrote:

 Can I get rid of the whole jar thing? Is there any way to run map reduce
 programs without using a jar? I do not want to use hadoop jar ... either.

 On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck javat...@gmail.com
 wrote:

  I think you need to set a property (mapred.jar) inside hadoop-site.xml,
  then
  you don't need to hardcode in your java code, and it will be fine.
  But I don't know if there is any way that we can set multiple jars, since
 a
  lot of times our own mapreduce class needs to reference other jars.
 
  On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com
 wrote:
 
   Hello,
  
   Can anyone tell me if there is any way running a map-reduce job from a
  java
   program without specifying the jar file by JobConf.setJar() method?
  
   Thanks,
  
   --
   Mohammad Farhan Husain
   Research Assistant
   Department of Computer Science
   Erik Jonsson School of Engineering and Computer Science
   University of Texas at Dallas
  
 



 --
 Mohammad Farhan Husain
 Research Assistant
 Department of Computer Science
 Erik Jonsson School of Engineering and Computer Science
 University of Texas at Dallas



Re: Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
Fixed. In the configure script src/native/
change
 echo 'int main(int argc, char **argv){return 0;}'  conftest.c
  if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then
if test ! -z `which objdump | grep -v 'no objdump'`; then
  ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep
lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'`
elif test ! -z `which ldd | grep -v 'no ldd'`; then
  ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed
's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'`
else
  { { echo $as_me:$LINENO: error: Can't find either 'objdump' or
'ldd' to compute the dynamic library for '-llzo2' 5
echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute
the dynamic library for '-llzo2' 2;}
   { (exit 1); exit 1; }; }
fi
  else
ac_cv_libname_lzo2=libnotfound.so
  fi
  rm -f conftest*

lzo2 to lzo.so.2 (again this depends on what the user has), also set
CFLAGS and LDFLAGS to include your lzo libs/incs



Saptarshi Guha



On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 I checked out hadoop-core-0.19
 export CFLAGS=$CUSTROOT/include
 export LDFLAGS=$CUSTROOT/lib

 (they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
 lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
 lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
 liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
 liblzo.so.2.0.0

 I then run (from hadoop-core-0.19.1/)
 ant -Dcompile.native=true

 I get messages like : (many others like this)
 exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1x.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)
     [exec] checking lzo/lzo1y.h usability... yes
     [exec] checking lzo/lzo1y.h presence... no
     [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1y.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)

 and finally,
 ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
 -o .libs/LzoCompressor.o
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
 In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
 error: expected expression before ',' token


 Any ideas?
 Saptarshi Guha



Reducer side output

2009-04-01 Thread Nagaraj K
Hi,

I am trying to do a side-effect output along with the usual output from the 
reducer.
But for the side-effect output attempt, I get the following error.

org.apache.hadoop.fs.permission.AccessControlException: 
org.apache.hadoop.fs.permission.AccessControlException: Permission denied: 
user=nagarajk, access=WRITE, inode=:hdfs:hdfs:rwxr-xr-x
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.init(DFSClient.java:2311)
at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477)
at 
org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383)
at 
org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310)
at 
org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)

My reducer code;
=
conf.set(group_stat, some_path); // Set during the configuration of jobconf 
object

public static class ReducerClass extends MapReduceBase implements 
ReducerText,DoubleWritable,Text,DoubleWritable {
FSDataOutputStream part=null;
JobConf conf;

public void reduce(Text key, IteratorDoubleWritable values,
   OutputCollectorText,DoubleWritable output,
   Reporter reporter) throws IOException {
double i_sum = 0.0;
while (values.hasNext()) {
i_sum += ((Double) values.next()).valueOf();
}
String [] fields = key.toString().split(SEP);
if(fields.length==1)
{
   if(part==null)
   {
   FileSystem fs = FileSystem.get(conf);
String jobpart = 
conf.get(mapred.task.partition);
part = fs.create(new 
Path(conf.get(group_stat),/part-000+jobpart)) ; // Failing here
   }
   part.writeBytes(fields[0] +\t + i_sum +\n);

}
else
output.collect(key, new DoubleWritable(i_sum));
}
}

Can you guys let me know what I am doing wrong here!.

Thanks
Nagaraj K


Re: datanode but not tasktracker

2009-04-01 Thread Owen O'Malley


On Apr 1, 2009, at 12:58 AM, Sandhya E wrote:


Hi

When the host is listed in slaves file both DataNode and TaskTracker
are started on that host. Is there a way in which we can configure a
node to be datanode and not tasktracker.


If you use hadoop-daemons.sh, you can pass a host list. So do:

ssh namenode hadoop-daemon.sh start namenode
ssh jobtracker hadoop-daemon.sh start jobtracker
hadoop-daemons.sh -hosts dfs.slaves start datanode
hadoop-daemons.hs -hosts mapred.slaves start tasktracker

-- Owen


Re: Building LZO on hadoop

2009-04-01 Thread Saptarshi Guha
Actually, if one installs the latest liblzo and sets CFLAGS, LDFLAGS
and LFLAGS correctly, things work fine.
Saptarshi Guha



On Wed, Apr 1, 2009 at 3:55 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 Fixed. In the configure script src/native/
 change
  echo 'int main(int argc, char **argv){return 0;}'  conftest.c
  if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then
        if test ! -z `which objdump | grep -v 'no objdump'`; then
      ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep
 lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'`
    elif test ! -z `which ldd | grep -v 'no ldd'`; then
      ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed
 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'`
    else
      { { echo $as_me:$LINENO: error: Can't find either 'objdump' or
 'ldd' to compute the dynamic library for '-llzo2' 5
 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute
 the dynamic library for '-llzo2' 2;}
   { (exit 1); exit 1; }; }
    fi
  else
    ac_cv_libname_lzo2=libnotfound.so
  fi
  rm -f conftest*

 lzo2 to lzo.so.2 (again this depends on what the user has), also set
 CFLAGS and LDFLAGS to include your lzo libs/incs



 Saptarshi Guha



 On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com 
 wrote:
 I checked out hadoop-core-0.19
 export CFLAGS=$CUSTROOT/include
 export LDFLAGS=$CUSTROOT/lib

 (they contain lzo which was built with --shared)
ls $CUSTROOT/include/lzo/
 lzo1a.h  lzo1b.h  lzo1c.h  lzo1f.h  lzo1.h  lzo1x.h  lzo1y.h  lzo1z.h
 lzo2a.h  lzo_asm.h  lzoconf.h  lzodefs.h  lzoutil.h

ls $CUSTROOT/lib/
 liblzo2.so  liblzo.a  liblzo.la  liblzo.so  liblzo.so.1  liblzo.so.2
 liblzo.so.2.0.0

 I then run (from hadoop-core-0.19.1/)
 ant -Dcompile.native=true

 I get messages like : (many others like this)
 exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1x.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)
     [exec] checking lzo/lzo1y.h usability... yes
     [exec] checking lzo/lzo1y.h presence... no
     [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler,
 rejected by the preprocessor!
     [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the
 compiler's result
     [exec] checking for lzo/lzo1y.h... yes
     [exec] checking Checking for the 'actual' dynamic-library for
 '-llzo2'... (cached)

 and finally,
 ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  -fPIC -DPIC
 -o .libs/LzoCompressor.o
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:
 In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
     [exec] 
 /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137:
 error: expected expression before ',' token


 Any ideas?
 Saptarshi Guha




hadoop job controller

2009-04-01 Thread Elia Mazzawi


I'm writing a perl program to submit jobs to the cluster,
then wait for the jobs to finish, and check that they have completed 
successfully.


I have some questions,

this shows what is running
./hadoop job  -list

and this shows the completion
./hadoop job -status  job_200903061521_0045


but i want something that just says pass / fail
cause with these, i have to check that its done then check that its 100% 
completed.


which must exist since the webapp jobtracker.jsp knows what is what.

also a controller like that must have been written many times already,  
are there any around?


Regards,
Elia


Re: Socket closed Exception

2009-04-01 Thread lohit

Thanks Koji, Raghu.
This seemed to solve our problem, havent seen this happen in the past 2 days. 
What is the typical value of ipc.client.idlethreshold on big clusters.
Does default value of 4000 suffice?

Lohit



- Original Message 
From: Koji Noguchi knogu...@yahoo-inc.com
To: core-user@hadoop.apache.org
Sent: Monday, March 30, 2009 9:30:04 AM
Subject: RE: Socket closed Exception

Lohit, 

You're right. We saw  java.net.SocketTimeoutException: timed out
waiting for rpc response and not Socket closed exception.

If you're getting closed exception, then I don't remember seeing that
problem on our clusters.

Our users often report Socket closed exception as a problem, but in
most cases those failures are due to jobs failing with completely
different reasons and race condition between 1) JobTracker removing
directory/killing tasks and 2) tasks failing with closed exception
before they get killed.

Koji



-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Monday, March 30, 2009 8:51 AM
To: core-user@hadoop.apache.org
Subject: Re: Socket closed Exception


Thanks Koji. 
If I look at the code, NameNode (RPC Server) seems to tear down idle
connections. Did you see 'Socket closed' exception instead of 'timed out
waiting for socket'?
We seem to hit the 'Socket closed' exception where client do not
timeout, but get back socket closed exception when they do RPC for
create/open/getFileInfo.

I will give this a try. Thanks again,
Lohit



- Original Message 
From: Koji Noguchi knogu...@yahoo-inc.com
To: core-user@hadoop.apache.org
Sent: Sunday, March 29, 2009 11:44:29 PM
Subject: RE: Socket closed Exception

Hi Lohit,

My initial guess would be
https://issues.apache.org/jira/browse/HADOOP-4040

When this happened on our 0.17 cluster, all of our (task) clients were
using the max idle time of 1 hour due to this bug instead of the
configured value of a few seconds.
Thus each client kept the connection up much longer than we expected.
(Not sure if this applies to your 0.15 cluster, but it sounds similar to
what we observed.)

This worked until namenode started hitting the max limit of '
ipc.client.idlethreshold'.  

  nameipc.client.idlethreshold/name
  value4000/value
  descriptionDefines the threshold number of connections after which
   connections will be inspected for idleness.
  /description

When inspecting for idleness, namenode uses

  nameipc.client.maxidletime/name
  value12/value
  descriptionDefines the maximum idle time for a connected client 
   after which it may be disconnected.
  /description

As a result, many connections got disconnected at once.
Clients only see the timeouts when they try to re-use that sockets the
next time and wait for 1 minute.  That's why they are not exactly at the
same time, but *almost* the same time.


# If this solves your problem, Raghu should get the credit. 
  He spent so many hours to solve this mystery for us. :)


Koji


-Original Message-
From: lohit [mailto:lohit...@yahoo.com] 
Sent: Sunday, March 29, 2009 11:56 AM
To: core-user@hadoop.apache.org
Subject: Socket closed Exception


Recently we are seeing lot of Socket closed exception in our cluster.
Many task's open/create/getFileInfo calls get back 'SocketException'
with message 'Socket closed'. We seem to see many tasks fail with same
error around same time. There are no warning or info messages in
NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases
where NameNode closes socket due heavy load or during conention of
resource of anykind?

Thanks,
Lohit



Profiling with Hadoop 0.17.2.1

2009-04-01 Thread Jimmy Wan
I'm trying to profile my map/reduce processes under Hadoop 0.17.2.
From looking at the hadoop-default.xml, the property
mapred.task.profile.params did not yet exist back then, so I'm
trying to add to the property mapred.child.java.opts with
-Xmx512m -verbose:gc
-Xrunhprof:cpu=samples,depth=6,thread=y,file=/tmp/@tas...@.txt
-Xloggc:/tmp/@tas...@.gc

The resulting JVMs won't have the hprof parameters when I look at them
via PS, the files are never created and there is no mention of dumping
stats in the logs.

Am I missing something?

I'd switch to 0.19.1, but I haven't had time to setup a migration plan
for my data yet.

Jimmy Wan


Re: Join Variation

2009-04-01 Thread jason hadoop
Just for fun, chapter 9 in my book is a work through of solving this class
of problem.


On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop jason.had...@gmail.comwrote:

 For the classic map/reduce job, you have 3 requirements.

 1) a comparator that provide the keys in ip address order, such that all
 keys in one of your ranges, would be contiguous, when sorted with the
 comparator
 2) a partitioner that ensures that all keys that should be together end up
 in the same partition
 3) and output value grouping comparator that considered all keys in a
 specified range equal.

 The comparator only sorts by the first part of the key, the search file has
 a 2 part key begin/end the input data has just a 1 part key.

 A partitioner that new ahead of time the group sets in your search set, in
 the way that the tera sort example works would be ideal:
 ie: it builds an index of ranges from your seen set so that the ranges get
 rougly evenly split between your reduces.
 This requires a pass over the search file to write out a summary file,
 which is then loaded by the partitioner.

 The output value grouping comparator, will get the keys in order of the
 first token, and will define the start of a group by the presence of a 2
 part key, and consider the group ended when either another 2 part key
 appears, or when the key value is larger than the second part of the
 starting key. - This does require that the grouping comparator maintain
 state.

 At this point, your reduce will be called with the first key in the key
 equivalence group of (3), with the values of all of the keys

 In your map, any address that is not in a range of interest is not passed
 to output.collect.

 For the map side join code, you have to define a comparator on the key type
 that defines your definition of equivalence and ordering, and call
 WritableComparator.define( Key.class, comparator.class ), to force the join
 code to use your comparator.

 For tables with duplicates, per the key comparator, in map side join, your
 map fuction will receive a row for every permutation of the duplicate keys:
 if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your
 map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4;



 On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara tamirkam...@gmail.comwrote:

 Thanks for all who replies.

 Stefan -
 I'm unable to see how converting IP ranges to network masks would help
 because different ranges can have the same network mask and with that I
 still have to do a comparison of two fields: the searched IP with
 from-IPmask.

 Pig - I'm familier with pig and use it many times, but I can't think of a
 way to write a pig script that will do this type of join. I'll ask the
 pig
 users group.

 The search file is indeed large in terms of the amount records. However, I
 don't see this as an issue yet, because I'm still puzzeled with how to
 write
 the job in plain MR. The join code is looking for an exact match in the
 keys
 and that is not what I need. Would a custom comperator which will look for
 a
 match in between the ranges, be the right choice to do this ?

 Thanks,
 Tamir

 On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop jason.had...@gmail.com
 wrote:

  If the search file data set is large, the issue becomes ensuring that
 only
  the required portion of search file is actually read, and that those
 reads
  are ordered, in search file's key order.
 
  If the data set is small, most any of the common patterns will work.
 
  I haven't looked at pig for a while, does pig now use indexes in map
 files,
  and take into account that a data set is sorted?
  Out of the box, the map side join code, org.apache.hadoop.mapred.join
 will
  do a decent job of this, but the entire search file set will be read.
  To stop reading the entire search file, a record reader or join type,
 would
  need to be put together to:
  a) skip to the first key of interest, using the index if available
  b) finish when the last possible key of interest has been delivered.
 
  On Wed, Mar 25, 2009 at 6:05 AM, John Lee j.benlin@gmail.com
 wrote:
 
   In addition to other suggestions, you could also take a look at
   building a Cascading job with a custom Joiner class.
  
   - John
  
   On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara tamirkam...@gmail.com
   wrote:
Hi,
   
We need to implement a Join with a between operator instead of an
  equal.
What we are trying to do is search a file for a key where the key
 falls
between two fields in the search file like this:
   
main file (ip, a, b):
(80, zz, yy)
(125, vv, bb)
   
search file (from-ip, to-ip, d, e):
(52, 75, xxx, yyy)
(78, 98, aaa, bbb)
(99, 115, xxx, ddd)
(125, 130, hhh, aaa)
(150, 162, qqq, sss)
   
the outcome should be in the form (ip, a, b, d, e):
(80, zz, yy, aaa, bbb)
(125, vv, bb, eee, hhh)
   
We could convert the ip ranges in the search file to single record
 ips
   and
then do a regular join, 

Strange Reduce Bahavior

2009-04-01 Thread Sriram Krishnan

Hi all,

I am new to this list, and relatively new to Hadoop itself. So if this  
question has been answered before, please point me to the right thread.


We are investigating the use of Hadoop for processing of geo-spatial  
data. In its most basic form, out data is laid out in files, where  
every row has the format -

{index, x, y, z, }

I am writing some basic Hadoop programs for selecting data based on x  
and y values, and everything appears to work correctly. I have Hadoop  
0.19.1 running in pseudo-distributed on a Linux box. However, as a  
academic exercise, I began writing some code that simply reads every  
single line of my input file, and does nothing else - I hoped to gain  
an understanding on how long it would take for Hadoop/HDFS to read the  
entire data set. My Map and Reduce functions are as follows:


public void map(LongWritable key, Text value,
OutputCollectorText, NullWritable output,
Reporter reporter) throws IOException {

// do nothing
return;
}

public void reduce(Text key, IteratorNullWritable values,
   OutputCollectorText, NullWritable output,
   Reporter reporter) throws IOException {
// do nothing
return;
}

My understanding is that the above map function will produce no  
intermediate key/value pairs - and hence, the reduce function should  
take no time at all. However, when I run this code, Hadoop seems to  
spend an inordinate amount of time in the reduce phase. Here is the  
Hadoop output -


09/04/01 20:11:12 INFO mapred.JobClient: Running job:  
job_200904011958_0005

09/04/01 20:11:13 INFO mapred.JobClient:  map 0% reduce 0%
09/04/01 20:11:21 INFO mapred.JobClient:  map 3% reduce 0%
09/04/01 20:11:25 INFO mapred.JobClient:  map 7% reduce 0%

09/04/01 20:13:17 INFO mapred.JobClient:  map 96% reduce 0%
09/04/01 20:13:20 INFO mapred.JobClient:  map 100% reduce 0%
09/04/01 20:13:30 INFO mapred.JobClient:  map 100% reduce 4%
09/04/01 20:13:35 INFO mapred.JobClient:  map 100% reduce 7%
...
09/04/01 20:14:05 INFO mapred.JobClient:  map 100% reduce 25%
09/04/01 20:14:10 INFO mapred.JobClient:  map 100% reduce 29%
09/04/01 20:14:15 INFO mapred.JobClient: Job complete:  
job_200904011958_0005

09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15
09/04/01 20:14:15 INFO mapred.JobClient:   File Systems
09/04/01 20:14:15 INFO mapred.JobClient: HDFS bytes read=1787707732
09/04/01 20:14:15 INFO mapred.JobClient: Local bytes read=10
09/04/01 20:14:15 INFO mapred.JobClient: Local bytes written=932
09/04/01 20:14:15 INFO mapred.JobClient:   Job Counters
09/04/01 20:14:15 INFO mapred.JobClient: Launched reduce tasks=1
09/04/01 20:14:15 INFO mapred.JobClient: Launched map tasks=27
09/04/01 20:14:15 INFO mapred.JobClient: Data-local map tasks=27
09/04/01 20:14:15 INFO mapred.JobClient:   Map-Reduce Framework
09/04/01 20:14:15 INFO mapred.JobClient: Reduce input groups=1
09/04/01 20:14:15 INFO mapred.JobClient: Combine output records=0
09/04/01 20:14:15 INFO mapred.JobClient: Map input records=44967808
09/04/01 20:14:15 INFO mapred.JobClient: Reduce output records=0
09/04/01 20:14:15 INFO mapred.JobClient: Map output bytes=2
09/04/01 20:14:15 INFO mapred.JobClient: Map input bytes=1787601210
09/04/01 20:14:15 INFO mapred.JobClient: Combine input records=0
09/04/01 20:14:15 INFO mapred.JobClient: Map output records=1
09/04/01 20:14:15 INFO mapred.JobClient: Reduce input records=0

As you can see, the reduce phase takes a little more than a minute -  
which is about a third of the execution time. However, the number of  
reduce tasks spawned is 1, and reduce input records is 0. Why does it  
spend so long on the reduce phase if there are 0 input records to be  
read? Furthermore, if the number of reduce jobs is 1, how is Hadoop  
able to report back the percentage completion of the reduce phase?  
Updating the number of reduce tasks using the  
JobConf.setNumReduceTasks() has no effect on the parallelism of map  
and reduce tasks.


Another interesting aspect is that my Hadoop code to do a select on  
the input files based on x and y values runs faster than my above  
Hadoop code - the select code contains a map function that emits the  
selected rows as intermediate keys, while the reduce code is pretty  
much an identity function. In fact, in this case, I see parallel  
execution of map and reduce tasks. I had thought that my Select code  
should be slower - because not only is it reading every single line of  
input (similar to my above experiment), but it is also doing some  
writes based on the selection criteria.


Thanks in advance for any pointers!
Sriram

--
Sriram Krishnan, Ph.D.
San Diego Supercomputer Center
http://www.sdsc.edu/~sriram