Inputs of Mapreduce
Hello to all I'm novice in working with mapreduce and i'm developping a mapreduce function that take xml documents as inputs. How can i make input files and precise it to the map function Thanks for help Best regards Khaled
acces to hdfs by ui browser
hi to all i configure hadoop and all deamons are running, when i want to access by internet browser to hdfs it fails. thanks for your help. best regards khaled
Re: acces to hdfs by ui browser
What messages shown? jzhang On Tue, Jul 13, 2010 at 5:15 PM, Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu wrote: hi to all i configure hadoop and all deamons are running, when i want to access by internet browser to hdfs it fails. thanks for your help. best regards khaled
Re: Inputs of Mapreduce
Khaled, Hadoop mapreduce innately takes in file line by line. XML files are not comprised of single lines. So you will have to pack a single xml document into a single line. Or you can make your own input format, which you need to refer to a guide book. 2010/7/13 Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu Hello to all I'm novice in working with mapreduce and i'm developping a mapreduce function that take xml documents as inputs. How can i make input files and precise it to the map function Thanks for help Best regards Khaled
Please help...
Please help me, I can't figure out how to fix this problem. I have a cluster of virtual machines under VMWare (windows XP is original OS): Ubuntu 8.10 Intel Pentium DUAL CPU E2180 @ 2 GHZ Memory 1024 MB I have a namenode and 8 more datanodes. I want to start teragen and terasort programs and do a benchmark analysis of a cluster running 1, 3 and all 8 datanodes. Datanodes have only 20GB configured HDFS capacity each, so it is a total of cca 150GB total. I have no problem generating the input data with 2 or 8 maps but problem comes out with terasort. When it comes to reduce phase, it generates a following error: 10/07/13 10:59:40 INFO mapred.JobClient: Task Id : attempt_201007131052_0002_r_00_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. As I understand I have to setup these parameters in mapred-site.xml to override default values: property namemapred.map.tasks/name value?/value /property property namemapred.reduce.tasks/name value?/value /property Does anyone know how to setup number of reducers so that it works :). Thank you...
Re: Inputs of Mapreduce
We tried using the hadoop streaming xml format a while ago and it didn't quite go as expected. I don't remember why, but, it gave some weird results- missing some records off, getting to 98% complete and then stopping etc. The Mahout project also has an XmlInputFormat [1] that we ended up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc. Hope that helps, Paul 1. http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java 2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html On 13 Jul 2010, at 12:26, Shuja Rehman wrote: Hi Khaled, XML files can be processed using hadoop streaming. check out the following link. http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F Regards Shuja On Tue, Jul 13, 2010 at 2:24 PM, edward choi mp2...@gmail.com wrote: Khaled, Hadoop mapreduce innately takes in file line by line. XML files are not comprised of single lines. So you will have to pack a single xml document into a single line. Or you can make your own input format, which you need to refer to a guide book. 2010/7/13 Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu Hello to all I'm novice in working with mapreduce and i'm developping a mapreduce function that take xml documents as inputs. How can i make input files and precise it to the map function Thanks for help Best regards Khaled -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445
using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Setting different hadoop-env.sh for DataNode, TaskTracker
Can anyone suggest a way to set different hadoop-env.sh values for DataNode and TaskTracker without having to duplicate the whole Hadoop conf directory? For example, to set a different HADOOP_NICENESS for DataNode and TaskTracker. TIA Matt Pouttu-Clarke iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: Setting different hadoop-env.sh for DataNode, TaskTracker
On Tue, Jul 13, 2010 at 10:46 AM, Matt Pouttu-Clarke matt.pouttu-cla...@icrossing.com wrote: Can anyone suggest a way to set different hadoop-env.sh values for DataNode and TaskTracker without having to duplicate the whole Hadoop conf directory? For example, to set a different HADOOP_NICENESS for DataNode and TaskTracker. TIA Matt Pouttu-Clarke iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. hadoop-env.sh is a script file you are free to write arbitrary shell code. if [ $HOSTNAME = server1 ] ; then dothis else dothat fi This allows you to push one file to all systems but now you manage scripts not files.
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
Hi, I am a newbie. I am curious to know how you discovered that all the blocks are written to datanode's hdfs? I thought the replication by namenode was transparent. Am I missing something? Thanks, Krishna On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
To test the block distribution, run the same put command from the NameNode and then again from the DataNode. Check the HDFS filesystem after both commands. In my case, a 2GB file was distributed mostly evenly across the datanodes when put was run on the NameNode, and then put only on the DataNode where I ran the put command On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar cvkrishnaku...@me.comwrote: Hi, I am a newbie. I am curious to know how you discovered that all the blocks are written to datanode's hdfs? I thought the replication by namenode was transparent. Am I missing something? Thanks, Krishna On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Re: WARN util.NativeCodeLoader: Unable to load native-hadoop library
On Jul 13, 2010, at 7:17 AM, Some Body wrote: I followed the steps from the native library guide We need to rewrite that guide. It is pretty clear that we have overloaded the term native libraries enough that no one understands what anyone else is talking about. 1. put the OS's libz libs in [r...@namenode]# pwd /opt/hadoop/lib/native [r...@namenode]# find . -name '*libz*' ./Linux-amd64-64/libz.so.1 ./Linux-amd64-64/libz.so.1.2.1.2 ./Linux-amd64-64/libz.so ./Linux-i386-32/libz.so.1 ./Linux-i386-32/libz.so.1.2.1.2 ./Linux-i386-32/libz.so You should have libhadoop there and it should be linked to libz. Run ldd against libhadoop and see what comes out.
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
When you write on a machine running a datanode process, the data is *always* written locally first. This is to provide an optimization to the MapReduce framework. The lesson here is that you should *never* use a datanode machine to load your data. Always do it outside the grid. Additionally, you can use fsck (filename) -files -locations -blocks to see where those blocks have been written. On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote: To test the block distribution, run the same put command from the NameNode and then again from the DataNode. Check the HDFS filesystem after both commands. In my case, a 2GB file was distributed mostly evenly across the datanodes when put was run on the NameNode, and then put only on the DataNode where I ran the put command On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar cvkrishnaku...@me.comwrote: Hi, I am a newbie. I am curious to know how you discovered that all the blocks are written to datanode's hdfs? I thought the replication by namenode was transparent. Am I missing something? Thanks, Krishna On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
Oh. Thanks for the reply. Regards, Krishna On Jul 13, 2010, at 9:51 AM, Allen Wittenauer wrote: When you write on a machine running a datanode process, the data is *always* written locally first. This is to provide an optimization to the MapReduce framework. The lesson here is that you should *never* use a datanode machine to load your data. Always do it outside the grid. Additionally, you can use fsck (filename) -files -locations -blocks to see where those blocks have been written. On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote: To test the block distribution, run the same put command from the NameNode and then again from the DataNode. Check the HDFS filesystem after both commands. In my case, a 2GB file was distributed mostly evenly across the datanodes when put was run on the NameNode, and then put only on the DataNode where I ran the put command On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar cvkrishnaku...@me.comwrote: Hi, I am a newbie. I am curious to know how you discovered that all the blocks are written to datanode's hdfs? I thought the replication by namenode was transparent. Am I missing something? Thanks, Krishna On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote: We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Why Hadoop's release process has slowed down???
Hello As mentioned in http://wiki.apache.org/hadoop/Hbase/HBaseVersions, hadoop slow down it's development. I interesting why this happens if it's true
Debuging hadoop core
Hi, I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To simplify the debug process, firstly I setup the Hadoop in single-node mode on my localhost. a) configure debug in eclipse, under tab main: project : hadoop-all main-class: org.apache.hadoop.util.RunJar under tab arguments: program arguments: absolute path for wordcount jar file/wordcount.jar org.wordcount.WordCount input-text-file-already-in-hdfs (text) desired-output-file (output) VM arguments: -Xmx256M under tab classpath: user entries : add external jar ( hadoop-0.20.3-core-dev.jar ) == so that I can debug my new built hadoop core jar. under tab source: I add the source file folder for the wordcount example ( in order lookup for the debug process). I apply these configuration and start debug process. b) the debugging works fine, and i can perform all operations for debug. However, i get following problem 2010-07-14 00:02:15,816 WARN conf.Configuration (Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 2010-07-14 00:02:16,535 INFO jvm.JvmMetrics (JvmMetrics.java:init(71)) - Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path does not exist: file:/home/hadoop/code/hadoop-0.20.2/text* at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.selfadjust.wordcount.WordCount.run(WordCount.java:32) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.selfadjust.wordcount.WordCount.main(WordCount.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) However, the file named text is the file already stored in the hdfs. Could you please help me with debugging process here, any pointers to the debugging environment would be very helpful. thanks, --PB
Re: Debuging hadoop core
Hello, When I copy the input file text in the * home/hadoop/code/hadoop-0.20.2/text* The debugging works fine except the case that Hadoop reads and write in local file system. This is because the parameters specified by me doesn't tell that the file is in HDFS. It simple says that input and output filename, and therefore while debugging the hadoop reads and writes from the local file system. How can i specify the path of input and output filename as absolute hdfs path for debugging purpose ? thanks, --PB * * On Wed, Jul 14, 2010 at 12:07 AM, Pramy Bhats pramybh...@googlemail.comwrote: Hi, I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To simplify the debug process, firstly I setup the Hadoop in single-node mode on my localhost. a) configure debug in eclipse, under tab main: project : hadoop-all main-class: org.apache.hadoop.util.RunJar under tab arguments: program arguments: absolute path for wordcount jar file/wordcount.jar org.wordcount.WordCount input-text-file-already-in-hdfs (text) desired-output-file (output) VM arguments: -Xmx256M under tab classpath: user entries : add external jar ( hadoop-0.20.3-core-dev.jar ) == so that I can debug my new built hadoop core jar. under tab source: I add the source file folder for the wordcount example ( in order lookup for the debug process). I apply these configuration and start debug process. b) the debugging works fine, and i can perform all operations for debug. However, i get following problem 2010-07-14 00:02:15,816 WARN conf.Configuration (Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 2010-07-14 00:02:16,535 INFO jvm.JvmMetrics (JvmMetrics.java:init(71)) - Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path does not exist: file:/home/hadoop/code/hadoop-0.20.2/text* at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.selfadjust.wordcount.WordCount.run(WordCount.java:32) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.selfadjust.wordcount.WordCount.main(WordCount.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) However, the file named text is the file already stored in the hdfs. Could you please help me with debugging process here, any pointers to the debugging environment would be very helpful. thanks, --PB
Thoughts about Hadoop cluster hardware
So we're talking to Dell about their new PowerEdge c2100 servers for a Hadoop cluster but I'm wondering. Isn't this still a little overboard for nodes in a cluster? I'm wondering if we bought say 100 poweredge 2750's instead of just 50 c2100's. The price would be about the same for the configuration we're talking about and we would get twice as many nodes. I'm curious if any other's are running Dell PowerEdge servers with Hadoop. We've also been kicking the idea around of going with blade servers (Dell and/or HP). Just curious Thanks!!
Re: Debuging hadoop core
Find hadoop-site.xml which Eclipse claimed was in your classpath. In the same directory, look for core-site.xml and add the following: property namefs.default.name/name valuehdfs://sjc9-flash-grid04.ciq.com:9000/value On Tue, Jul 13, 2010 at 3:07 PM, Pramy Bhats pramybh...@googlemail.comwrote: Hi, I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To simplify the debug process, firstly I setup the Hadoop in single-node mode on my localhost. a) configure debug in eclipse, under tab main: project : hadoop-all main-class: org.apache.hadoop.util.RunJar under tab arguments: program arguments: absolute path for wordcount jar file/wordcount.jar org.wordcount.WordCount input-text-file-already-in-hdfs (text) desired-output-file (output) VM arguments: -Xmx256M under tab classpath: user entries : add external jar ( hadoop-0.20.3-core-dev.jar ) == so that I can debug my new built hadoop core jar. under tab source: I add the source file folder for the wordcount example ( in order lookup for the debug process). I apply these configuration and start debug process. b) the debugging works fine, and i can perform all operations for debug. However, i get following problem 2010-07-14 00:02:15,816 WARN conf.Configuration (Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 2010-07-14 00:02:16,535 INFO jvm.JvmMetrics (JvmMetrics.java:init(71)) - Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path does not exist: file:/home/hadoop/code/hadoop-0.20.2/text* at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.selfadjust.wordcount.WordCount.run(WordCount.java:32) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.selfadjust.wordcount.WordCount.main(WordCount.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) However, the file named text is the file already stored in the hdfs. Could you please help me with debugging process here, any pointers to the debugging environment would be very helpful. thanks, --PB
Re: Thoughts about Hadoop cluster hardware
On Jul 13, 2010, at 5:00 PM, u235sentinel wrote: So we're talking to Dell about their new PowerEdge c2100 servers for a Hadoop cluster but I'm wondering. Isn't this still a little overboard for nodes in a cluster? I'm wondering if we bought say 100 poweredge 2750's instead of just 50 c2100's. The price would be about the same for the configuration we're talking about and we would get twice as many nodes. Ultimately, it depends upon your job flow and how much data you have. FWIW we're currently using a Sun equivalent of the C2100s w/8 of the 12 drive slots filled. You need a *LOT* of iops to make it worth while. [From what I've seen, even people who think they have a lot of iops generally have other problems with their code/tuning that are causing the iops. So even if you think you have a lot, you may not.] I'm curious if any other's are running Dell PowerEdge servers with Hadoop. We've also been kicking the idea around of going with blade servers (Dell and/or HP). If you are thinking traditional blade where storage is comes mainly from NAS or SAN, you are going to be very, very unhappy unless your data set is very, very tiny. Check out the PoweredBy page on the wiki. Quite a few folks list their gear. FWIW, we're currently evaluating HP SLs and should be getting some Dell C6100s in soon, assuming Dell can deliver the eval unit on time.