what change to be done in OutputCollector to print custom writable object
Hi, I am learning how to make custom-writable working. So I have implemented a simple MyWriitable class. And I can play with the MyWritable object within the Map-Reduce. but suppose in Reduce Values are a type of MyWritable object and I put them into OutputCollector to get final output. Since value is a custom object I can't get them into file but a reference. What and where I have to make changes /additions so that print into file function handles the custom-writable object? Thanks regards, -- - Deepak Diwakar,
datanode but not tasktracker
Hi When the host is listed in slaves file both DataNode and TaskTracker are started on that host. Is there a way in which we can configure a node to be datanode and not tasktracker. Thanks Sandhya
Re: Please help!
Thanks, Ricky. I am reading your site. Richard On Tue, Mar 31, 2009 at 4:59 PM, Ricky Ho r...@adobe.com wrote: I have written a blog about Hadoop's implementation couple months back here at ... http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html Note that Hadoop is not about reducing latency. It is about increasing throughput (not throughput per resource) by adding more machines in case your problem is data parallel. Time-wise: If it takes T seconds to process B amount of data, then by using Hadoop with N machines, you can process it within cT/N seconds where constant c 1 accounts for the overhead. Space-wise: If it takes M amount of memory during the processing, then by using Hadoop with N machines, you need M/N + c Bandwidth-wise: You definitely need more bandwidth because a distributed file system is used. And it also depends on your read / write ratio and how many ways of replication. ... Need more time to think of the formula... Rgds, Ricky -Original Message- From: Hadooper [mailto:kusanagiyang.had...@gmail.com] Sent: Tuesday, March 31, 2009 3:35 PM To: core-user@hadoop.apache.org Subject: Re: Please help! Thanks, Jim. I am very familiar with Google's original publication. On Tue, Mar 31, 2009 at 4:31 PM, Jim Twensky jim.twen...@gmail.com wrote: See the original Map Reduce paper by Google at http://labs.google.com/papers/mapreduce.html and please don't spam the list. -jim On Tue, Mar 31, 2009 at 6:15 PM, Hadooper kusanagiyang.had...@gmail.com wrote: Dear developers, Is there any detailed example of how Hadoop processes input? Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives a good idea, but I want to see input data being passed from class to class, and how each class manipulates data. The purpose is to analyze the time and space complexity of Hadoop as a generalized computational model/algorithm. I tried to search the web and could not find more detail. Any pointer/hint? Thanks a million. -- Cheers! Hadoop core -- Cheers! Hadoop core -- Cheers! Hadoop core
Re: Please help
Thanks for the reminder. On Tue, Mar 31, 2009 at 5:37 PM, Amandeep Khurana ama...@gmail.com wrote: Have you read the Map Reduce paper? You might be able to find some pointers there for your analysis. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Mar 31, 2009 at 4:28 PM, Hadooper kusanagiyang.had...@gmail.com wrote: Dear developers, Is there any detailed example of how Hadoop processes input? Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives a good idea, but I want to see input data being passed from class to class, and how each class manipulates data. The purpose is to analyze the time and space complexity of Hadoop as a generalized computational model/algorithm. I tried to search the web and could not find more detail. Any pointer/hint? Thanks a million. -- Cheers! Hadoop core -- Cheers! Hadoop core
Re: what change to be done in OutputCollector to print custom writable object
Deepak Diwakar wrote: Hi, I am learning how to make custom-writable working. So I have implemented a simple MyWriitable class. And I can play with the MyWritable object within the Map-Reduce. but suppose in Reduce Values are a type of MyWritable object and I put them into OutputCollector to get final output. Since value is a custom object I can't get them into file but a reference. What and where I have to make changes /additions so that print into file function handles the custom-writable object? Thanks regards, just implement toString() in your MyWritable class.
Re: datanode but not tasktracker
Sandhya, You can specify the file to use for slaves so instead of start-all you can start-dfs with the normal slave file and start-mapred with a specified file on the command line. J-D On Wed, Apr 1, 2009 at 3:58 AM, Sandhya E sandhyabhas...@gmail.com wrote: Hi When the host is listed in slaves file both DataNode and TaskTracker are started on that host. Is there a way in which we can configure a node to be datanode and not tasktracker. Thanks Sandhya
RE: Eclipse version for Hadoop-0.19.1
Hi Please tell which eclipse version should I use which support hadoop-0.19.0-eclipse-plugin and from where I can download it? -Original Message- From: Rasit OZDAS [mailto:rasitoz...@gmail.com] Sent: Friday, March 20, 2009 9:19 PM To: core-user@hadoop.apache.org Subject: Re: Eclipse version for Hadoop-0.19.1 I also couldn't succeed in running it in ganymede. I use eclipse europa with v. 0.19.0. I would give it a try for 19.1, though. 2009/3/18 Puri, Aseem aseem.p...@honeywell.com I am using Hadoop - HBase 0.18 and my eclipse supports hadoop-0.18.0-eclipse-plugin. When I switch to Hadoop 0.19.1 and use hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1. Can anyone pls tell which version of eclipse supports Hadoop 0.19.1? Thanks Regards Aseem Puri -- M. Raşit ÖZDAŞ
Re: datanode but not tasktracker
Not sure why you would want a node be a datanode but not a tasktracker because you normally would want the map/reduce task to run where the data is stored. Bill On Wed, Apr 1, 2009 at 3:58 AM, Sandhya E sandhyabhas...@gmail.com wrote: Hi When the host is listed in slaves file both DataNode and TaskTracker are started on that host. Is there a way in which we can configure a node to be datanode and not tasktracker. Thanks Sandhya
RE: Eclipse version for Hadoop-0.19.1
Rasit, Thanks, I got eclipse europa and it is also supporting map reduce perspective. Sim -Original Message- From: Rasit OZDAS [mailto:rasitoz...@gmail.com] Sent: Wednesday, April 01, 2009 8:07 PM To: core-user@hadoop.apache.org Subject: Re: Eclipse version for Hadoop-0.19.1 Try this page for eclipse europa: http://rm.mirror.garr.it/mirrors/eclipse/technology/epp/downloads/release/europa/winter/ This is the fastest for me, if it's too slow, you can download from here: http://archive.eclipse.org/eclipse/downloads/ Rasit 2009/4/1 Puri, Aseem aseem.p...@honeywell.com Hi Please tell which eclipse version should I use which support hadoop-0.19.0-eclipse-plugin and from where I can download it? -Original Message- From: Rasit OZDAS [mailto:rasitoz...@gmail.com] Sent: Friday, March 20, 2009 9:19 PM To: core-user@hadoop.apache.org Subject: Re: Eclipse version for Hadoop-0.19.1 I also couldn't succeed in running it in ganymede. I use eclipse europa with v. 0.19.0. I would give it a try for 19.1, though. 2009/3/18 Puri, Aseem aseem.p...@honeywell.com I am using Hadoop - HBase 0.18 and my eclipse supports hadoop-0.18.0-eclipse-plugin. When I switch to Hadoop 0.19.1 and use hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1. Can anyone pls tell which version of eclipse supports Hadoop 0.19.1? Thanks Regards Aseem Puri -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Cannot resolve Datonode address in slave file
Hi I have a small Hadoop cluster with 3 machines. One is my NameNode/JobTracker + DataNode/TaskTracker and other 2 are DataNode/TaskTracker. So I have made all 3 as slave. In slave file I have put names of all there machines as: master slave slave1 When I start Hadoop cluster it always start DataNode/TaskTracker on last slave in the list and do not start DataNode/TaskTracker on other two machines. Also I got the message as: slave1: : no address associated with name : no address associated with name slave1: starting datanode, logging to /home/HadoopAdmin/hadoop/bin/../logs/hadoo p-HadoopAdmin-datanode-ie11dtxpficbfise.out If I change the order in slave file like this: slave slave1 master then DataNode/TaskTracker on master m/c starts and not on other two. Please tell how I should solve this problem. Sim
Running MapReduce without setJar
Hello, Can anyone tell me if there is any way running a map-reduce job from a java program without specifying the jar file by JobConf.setJar() method? Thanks, -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: Running MapReduce without setJar
I think you need to set a property (mapred.jar) inside hadoop-site.xml, then you don't need to hardcode in your java code, and it will be fine. But I don't know if there is any way that we can set multiple jars, since a lot of times our own mapreduce class needs to reference other jars. On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com wrote: Hello, Can anyone tell me if there is any way running a map-reduce job from a java program without specifying the jar file by JobConf.setJar() method? Thanks, -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Building LZO on hadoop
I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
Re: Running MapReduce without setJar
Can I get rid of the whole jar thing? Is there any way to run map reduce programs without using a jar? I do not want to use hadoop jar ... either. On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck javat...@gmail.comwrote: I think you need to set a property (mapred.jar) inside hadoop-site.xml, then you don't need to hardcode in your java code, and it will be fine. But I don't know if there is any way that we can set multiple jars, since a lot of times our own mapreduce class needs to reference other jars. On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com wrote: Hello, Can anyone tell me if there is any way running a map-reduce job from a java program without specifying the jar file by JobConf.setJar() method? Thanks, -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: Running MapReduce without setJar
you can run from java program: JobConf conf = new JobConf(MapReduceWork.class); // setting your params JobClient.runJob(conf); On Wed, Apr 1, 2009 at 11:42 AM, Farhan Husain russ...@gmail.com wrote: Can I get rid of the whole jar thing? Is there any way to run map reduce programs without using a jar? I do not want to use hadoop jar ... either. On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck javat...@gmail.com wrote: I think you need to set a property (mapred.jar) inside hadoop-site.xml, then you don't need to hardcode in your java code, and it will be fine. But I don't know if there is any way that we can set multiple jars, since a lot of times our own mapreduce class needs to reference other jars. On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain russ...@gmail.com wrote: Hello, Can anyone tell me if there is any way running a map-reduce job from a java program without specifying the jar file by JobConf.setJar() method? Thanks, -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: Building LZO on hadoop
Fixed. In the configure script src/native/ change echo 'int main(int argc, char **argv){return 0;}' conftest.c if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then if test ! -z `which objdump | grep -v 'no objdump'`; then ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'` elif test ! -z `which ldd | grep -v 'no ldd'`; then ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'` else { { echo $as_me:$LINENO: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 5 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 2;} { (exit 1); exit 1; }; } fi else ac_cv_libname_lzo2=libnotfound.so fi rm -f conftest* lzo2 to lzo.so.2 (again this depends on what the user has), also set CFLAGS and LDFLAGS to include your lzo libs/incs Saptarshi Guha On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
Reducer side output
Hi, I am trying to do a side-effect output along with the usual output from the reducer. But for the side-effect output attempt, I get the following error. org.apache.hadoop.fs.permission.AccessControlException: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=nagarajk, access=WRITE, inode=:hdfs:hdfs:rwxr-xr-x at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.init(DFSClient.java:2311) at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477) at org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383) at org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310) at org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206) My reducer code; = conf.set(group_stat, some_path); // Set during the configuration of jobconf object public static class ReducerClass extends MapReduceBase implements ReducerText,DoubleWritable,Text,DoubleWritable { FSDataOutputStream part=null; JobConf conf; public void reduce(Text key, IteratorDoubleWritable values, OutputCollectorText,DoubleWritable output, Reporter reporter) throws IOException { double i_sum = 0.0; while (values.hasNext()) { i_sum += ((Double) values.next()).valueOf(); } String [] fields = key.toString().split(SEP); if(fields.length==1) { if(part==null) { FileSystem fs = FileSystem.get(conf); String jobpart = conf.get(mapred.task.partition); part = fs.create(new Path(conf.get(group_stat),/part-000+jobpart)) ; // Failing here } part.writeBytes(fields[0] +\t + i_sum +\n); } else output.collect(key, new DoubleWritable(i_sum)); } } Can you guys let me know what I am doing wrong here!. Thanks Nagaraj K
Re: datanode but not tasktracker
On Apr 1, 2009, at 12:58 AM, Sandhya E wrote: Hi When the host is listed in slaves file both DataNode and TaskTracker are started on that host. Is there a way in which we can configure a node to be datanode and not tasktracker. If you use hadoop-daemons.sh, you can pass a host list. So do: ssh namenode hadoop-daemon.sh start namenode ssh jobtracker hadoop-daemon.sh start jobtracker hadoop-daemons.sh -hosts dfs.slaves start datanode hadoop-daemons.hs -hosts mapred.slaves start tasktracker -- Owen
Re: Building LZO on hadoop
Actually, if one installs the latest liblzo and sets CFLAGS, LDFLAGS and LFLAGS correctly, things work fine. Saptarshi Guha On Wed, Apr 1, 2009 at 3:55 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Fixed. In the configure script src/native/ change echo 'int main(int argc, char **argv){return 0;}' conftest.c if test -z `${CC} ${LDFLAGS} -o conftest conftest.c -llzo2 21`; then if test ! -z `which objdump | grep -v 'no objdump'`; then ac_cv_libname_lzo2=`objdump -p conftest | grep NEEDED | grep lzo2 | sed 's/\W*NEEDED\W*\(.*\)\W*$/\\1\/'` elif test ! -z `which ldd | grep -v 'no ldd'`; then ac_cv_libname_lzo2=`ldd conftest | grep lzo2 | sed 's/^[^A-Za-z0-9]*\([A-Za-z0-9\.]*\)[^A-Za-z0-9]*=.*$/\\1\/'` else { { echo $as_me:$LINENO: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 5 echo $as_me: error: Can't find either 'objdump' or 'ldd' to compute the dynamic library for '-llzo2' 2;} { (exit 1); exit 1; }; } fi else ac_cv_libname_lzo2=libnotfound.so fi rm -f conftest* lzo2 to lzo.so.2 (again this depends on what the user has), also set CFLAGS and LDFLAGS to include your lzo libs/incs Saptarshi Guha On Wed, Apr 1, 2009 at 2:29 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: I checked out hadoop-core-0.19 export CFLAGS=$CUSTROOT/include export LDFLAGS=$CUSTROOT/lib (they contain lzo which was built with --shared) ls $CUSTROOT/include/lzo/ lzo1a.h lzo1b.h lzo1c.h lzo1f.h lzo1.h lzo1x.h lzo1y.h lzo1z.h lzo2a.h lzo_asm.h lzoconf.h lzodefs.h lzoutil.h ls $CUSTROOT/lib/ liblzo2.so liblzo.a liblzo.la liblzo.so liblzo.so.1 liblzo.so.2 liblzo.so.2.0.0 I then run (from hadoop-core-0.19.1/) ant -Dcompile.native=true I get messages like : (many others like this) exec] configure: WARNING: lzo/lzo1x.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1x.h: proceeding with the compiler's result [exec] checking for lzo/lzo1x.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) [exec] checking lzo/lzo1y.h usability... yes [exec] checking lzo/lzo1y.h presence... no [exec] configure: WARNING: lzo/lzo1y.h: accepted by the compiler, rejected by the preprocessor! [exec] configure: WARNING: lzo/lzo1y.h: proceeding with the compiler's result [exec] checking for lzo/lzo1y.h... yes [exec] checking Checking for the 'actual' dynamic-library for '-llzo2'... (cached) and finally, ive/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c -fPIC -DPIC -o .libs/LzoCompressor.o [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: In function 'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs': [exec] /ln/meraki/custom/hadoop-core-0.19.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:137: error: expected expression before ',' token Any ideas? Saptarshi Guha
hadoop job controller
I'm writing a perl program to submit jobs to the cluster, then wait for the jobs to finish, and check that they have completed successfully. I have some questions, this shows what is running ./hadoop job -list and this shows the completion ./hadoop job -status job_200903061521_0045 but i want something that just says pass / fail cause with these, i have to check that its done then check that its 100% completed. which must exist since the webapp jobtracker.jsp knows what is what. also a controller like that must have been written many times already, are there any around? Regards, Elia
Re: Socket closed Exception
Thanks Koji, Raghu. This seemed to solve our problem, havent seen this happen in the past 2 days. What is the typical value of ipc.client.idlethreshold on big clusters. Does default value of 4000 suffice? Lohit - Original Message From: Koji Noguchi knogu...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Monday, March 30, 2009 9:30:04 AM Subject: RE: Socket closed Exception Lohit, You're right. We saw java.net.SocketTimeoutException: timed out waiting for rpc response and not Socket closed exception. If you're getting closed exception, then I don't remember seeing that problem on our clusters. Our users often report Socket closed exception as a problem, but in most cases those failures are due to jobs failing with completely different reasons and race condition between 1) JobTracker removing directory/killing tasks and 2) tasks failing with closed exception before they get killed. Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Monday, March 30, 2009 8:51 AM To: core-user@hadoop.apache.org Subject: Re: Socket closed Exception Thanks Koji. If I look at the code, NameNode (RPC Server) seems to tear down idle connections. Did you see 'Socket closed' exception instead of 'timed out waiting for socket'? We seem to hit the 'Socket closed' exception where client do not timeout, but get back socket closed exception when they do RPC for create/open/getFileInfo. I will give this a try. Thanks again, Lohit - Original Message From: Koji Noguchi knogu...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Sunday, March 29, 2009 11:44:29 PM Subject: RE: Socket closed Exception Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. nameipc.client.idlethreshold/name value4000/value descriptionDefines the threshold number of connections after which connections will be inspected for idleness. /description When inspecting for idleness, namenode uses nameipc.client.maxidletime/name value12/value descriptionDefines the maximum idle time for a connected client after which it may be disconnected. /description As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
Profiling with Hadoop 0.17.2.1
I'm trying to profile my map/reduce processes under Hadoop 0.17.2. From looking at the hadoop-default.xml, the property mapred.task.profile.params did not yet exist back then, so I'm trying to add to the property mapred.child.java.opts with -Xmx512m -verbose:gc -Xrunhprof:cpu=samples,depth=6,thread=y,file=/tmp/@tas...@.txt -Xloggc:/tmp/@tas...@.gc The resulting JVMs won't have the hprof parameters when I look at them via PS, the files are never created and there is no mention of dumping stats in the logs. Am I missing something? I'd switch to 0.19.1, but I haven't had time to setup a migration plan for my data yet. Jimmy Wan
Re: Join Variation
Just for fun, chapter 9 in my book is a work through of solving this class of problem. On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop jason.had...@gmail.comwrote: For the classic map/reduce job, you have 3 requirements. 1) a comparator that provide the keys in ip address order, such that all keys in one of your ranges, would be contiguous, when sorted with the comparator 2) a partitioner that ensures that all keys that should be together end up in the same partition 3) and output value grouping comparator that considered all keys in a specified range equal. The comparator only sorts by the first part of the key, the search file has a 2 part key begin/end the input data has just a 1 part key. A partitioner that new ahead of time the group sets in your search set, in the way that the tera sort example works would be ideal: ie: it builds an index of ranges from your seen set so that the ranges get rougly evenly split between your reduces. This requires a pass over the search file to write out a summary file, which is then loaded by the partitioner. The output value grouping comparator, will get the keys in order of the first token, and will define the start of a group by the presence of a 2 part key, and consider the group ended when either another 2 part key appears, or when the key value is larger than the second part of the starting key. - This does require that the grouping comparator maintain state. At this point, your reduce will be called with the first key in the key equivalence group of (3), with the values of all of the keys In your map, any address that is not in a range of interest is not passed to output.collect. For the map side join code, you have to define a comparator on the key type that defines your definition of equivalence and ordering, and call WritableComparator.define( Key.class, comparator.class ), to force the join code to use your comparator. For tables with duplicates, per the key comparator, in map side join, your map fuction will receive a row for every permutation of the duplicate keys: if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4; On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara tamirkam...@gmail.comwrote: Thanks for all who replies. Stefan - I'm unable to see how converting IP ranges to network masks would help because different ranges can have the same network mask and with that I still have to do a comparison of two fields: the searched IP with from-IPmask. Pig - I'm familier with pig and use it many times, but I can't think of a way to write a pig script that will do this type of join. I'll ask the pig users group. The search file is indeed large in terms of the amount records. However, I don't see this as an issue yet, because I'm still puzzeled with how to write the job in plain MR. The join code is looking for an exact match in the keys and that is not what I need. Would a custom comperator which will look for a match in between the ranges, be the right choice to do this ? Thanks, Tamir On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop jason.had...@gmail.com wrote: If the search file data set is large, the issue becomes ensuring that only the required portion of search file is actually read, and that those reads are ordered, in search file's key order. If the data set is small, most any of the common patterns will work. I haven't looked at pig for a while, does pig now use indexes in map files, and take into account that a data set is sorted? Out of the box, the map side join code, org.apache.hadoop.mapred.join will do a decent job of this, but the entire search file set will be read. To stop reading the entire search file, a record reader or join type, would need to be put together to: a) skip to the first key of interest, using the index if available b) finish when the last possible key of interest has been delivered. On Wed, Mar 25, 2009 at 6:05 AM, John Lee j.benlin@gmail.com wrote: In addition to other suggestions, you could also take a look at building a Cascading job with a custom Joiner class. - John On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, We need to implement a Join with a between operator instead of an equal. What we are trying to do is search a file for a key where the key falls between two fields in the search file like this: main file (ip, a, b): (80, zz, yy) (125, vv, bb) search file (from-ip, to-ip, d, e): (52, 75, xxx, yyy) (78, 98, aaa, bbb) (99, 115, xxx, ddd) (125, 130, hhh, aaa) (150, 162, qqq, sss) the outcome should be in the form (ip, a, b, d, e): (80, zz, yy, aaa, bbb) (125, vv, bb, eee, hhh) We could convert the ip ranges in the search file to single record ips and then do a regular join,
Strange Reduce Bahavior
Hi all, I am new to this list, and relatively new to Hadoop itself. So if this question has been answered before, please point me to the right thread. We are investigating the use of Hadoop for processing of geo-spatial data. In its most basic form, out data is laid out in files, where every row has the format - {index, x, y, z, } I am writing some basic Hadoop programs for selecting data based on x and y values, and everything appears to work correctly. I have Hadoop 0.19.1 running in pseudo-distributed on a Linux box. However, as a academic exercise, I began writing some code that simply reads every single line of my input file, and does nothing else - I hoped to gain an understanding on how long it would take for Hadoop/HDFS to read the entire data set. My Map and Reduce functions are as follows: public void map(LongWritable key, Text value, OutputCollectorText, NullWritable output, Reporter reporter) throws IOException { // do nothing return; } public void reduce(Text key, IteratorNullWritable values, OutputCollectorText, NullWritable output, Reporter reporter) throws IOException { // do nothing return; } My understanding is that the above map function will produce no intermediate key/value pairs - and hence, the reduce function should take no time at all. However, when I run this code, Hadoop seems to spend an inordinate amount of time in the reduce phase. Here is the Hadoop output - 09/04/01 20:11:12 INFO mapred.JobClient: Running job: job_200904011958_0005 09/04/01 20:11:13 INFO mapred.JobClient: map 0% reduce 0% 09/04/01 20:11:21 INFO mapred.JobClient: map 3% reduce 0% 09/04/01 20:11:25 INFO mapred.JobClient: map 7% reduce 0% 09/04/01 20:13:17 INFO mapred.JobClient: map 96% reduce 0% 09/04/01 20:13:20 INFO mapred.JobClient: map 100% reduce 0% 09/04/01 20:13:30 INFO mapred.JobClient: map 100% reduce 4% 09/04/01 20:13:35 INFO mapred.JobClient: map 100% reduce 7% ... 09/04/01 20:14:05 INFO mapred.JobClient: map 100% reduce 25% 09/04/01 20:14:10 INFO mapred.JobClient: map 100% reduce 29% 09/04/01 20:14:15 INFO mapred.JobClient: Job complete: job_200904011958_0005 09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15 09/04/01 20:14:15 INFO mapred.JobClient: File Systems 09/04/01 20:14:15 INFO mapred.JobClient: HDFS bytes read=1787707732 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes read=10 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes written=932 09/04/01 20:14:15 INFO mapred.JobClient: Job Counters 09/04/01 20:14:15 INFO mapred.JobClient: Launched reduce tasks=1 09/04/01 20:14:15 INFO mapred.JobClient: Launched map tasks=27 09/04/01 20:14:15 INFO mapred.JobClient: Data-local map tasks=27 09/04/01 20:14:15 INFO mapred.JobClient: Map-Reduce Framework 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input groups=1 09/04/01 20:14:15 INFO mapred.JobClient: Combine output records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map input records=44967808 09/04/01 20:14:15 INFO mapred.JobClient: Reduce output records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map output bytes=2 09/04/01 20:14:15 INFO mapred.JobClient: Map input bytes=1787601210 09/04/01 20:14:15 INFO mapred.JobClient: Combine input records=0 09/04/01 20:14:15 INFO mapred.JobClient: Map output records=1 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input records=0 As you can see, the reduce phase takes a little more than a minute - which is about a third of the execution time. However, the number of reduce tasks spawned is 1, and reduce input records is 0. Why does it spend so long on the reduce phase if there are 0 input records to be read? Furthermore, if the number of reduce jobs is 1, how is Hadoop able to report back the percentage completion of the reduce phase? Updating the number of reduce tasks using the JobConf.setNumReduceTasks() has no effect on the parallelism of map and reduce tasks. Another interesting aspect is that my Hadoop code to do a select on the input files based on x and y values runs faster than my above Hadoop code - the select code contains a map function that emits the selected rows as intermediate keys, while the reduce code is pretty much an identity function. In fact, in this case, I see parallel execution of map and reduce tasks. I had thought that my Select code should be slower - because not only is it reading every single line of input (similar to my above experiment), but it is also doing some writes based on the selection criteria. Thanks in advance for any pointers! Sriram -- Sriram Krishnan, Ph.D. San Diego Supercomputer Center http://www.sdsc.edu/~sriram