hadoop on fedora 15
Hi, Has anybody been able to run hadoop standalone mode on fedora 15 ? I have installed it correctly. It runs till map but gets stuck in reduce. It fails with the error mapred.JobClient Status : FAILED Too many fetch-failures. I read several articles on net for this problem, all of them say about the /etc/hosts and some say firewall issue. I enabled firewall for the port range and also checked my /etc/hosts file. its content is localhost and that is the only line in it. Is sun-java absolute necessary or open-jdk will work ? can someone give me some suggestion to get along with this problem ? Thanks regard Manish
Re: one quesiton in the book of hadoop:definitive guide 2 edition
On Fri, 5 Aug 2011 08:50:02 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: The book also mentioned the value if mutable, I think the key might also be mutable, means as we loop each value in iterableNullWritable, the content of the key object is reset. The mutability of the value is one of the weirdnesses of Hadoop you have to get used to, and one of the few times it becomes important that Java object semantics are pointer semantics. Anyway, it wouldn't surprise me if the key were also replaced on iteration, but I'd have to dig into the Hadoop code to check on that. Or you could!
Re: hadoop on fedora 15
disable iptables and try again On Fri, Aug 5, 2011 at 2:20 PM, Manish manish.iitg...@gmail.com wrote: Hi, Has anybody been able to run hadoop standalone mode on fedora 15 ? I have installed it correctly. It runs till map but gets stuck in reduce. It fails with the error mapred.JobClient Status : FAILED Too many fetch-failures. I read several articles on net for this problem, all of them say about the /etc/hosts and some say firewall issue. I enabled firewall for the port range and also checked my /etc/hosts file. its content is localhost and that is the only line in it. Is sun-java absolute necessary or open-jdk will work ? can someone give me some suggestion to get along with this problem ? Thanks regard Manish -- Join me at http://hadoopworkshop.eventbrite.com/
Re: hadoop on fedora 15
Sun JDK is what its been thoroughly tested upon. You can run on OpenJDK perhaps, but YMMV. Hadoop has a strict requirement of having a proper network setup before use. What port range did you open? TaskTracker would use 50060 for intercommunication (over lo, if its bound to that). Check if your daemons are binding to the right interfaces and have proper name-IP resolutions, and then check if that port is allowed communication upon. On Fri, Aug 5, 2011 at 2:20 PM, Manish manish.iitg...@gmail.com wrote: Hi, Has anybody been able to run hadoop standalone mode on fedora 15 ? I have installed it correctly. It runs till map but gets stuck in reduce. It fails with the error mapred.JobClient Status : FAILED Too many fetch-failures. I read several articles on net for this problem, all of them say about the /etc/hosts and some say firewall issue. I enabled firewall for the port range and also checked my /etc/hosts file. its content is localhost and that is the only line in it. Is sun-java absolute necessary or open-jdk will work ? can someone give me some suggestion to get along with this problem ? Thanks regard Manish -- Harsh J
Re: Upload, then decompress archive on HDFS?
I suppose we could do with a simple identity mapping/identity reducing example/tool that can easily be reutilized for purposes such as these. Could you file a JIRA on this? The -text is like -cat but has codec and some file format detection. Hopefully it should work for your case. On Fri, Aug 5, 2011 at 8:44 PM, Keith Wiley kwi...@keithwiley.com wrote: I can envision an M/R job for the purpose of manipulating hdfs, such as (de)compressing files and resaving them back to HDFS. I just didn't think it should be necessary to *write a program* to do something so seemingly minimal. This (tarring/compressing/etc.) seems like an obvious method for moving data back and forth; I would expect the tools to support it. I'll read up on -text. Maybe that really is what I wanted, although I'm dubious since this has nothing to do with textual data at all. Anyway, I'll see what I can find on that. Thanks. On Aug 4, 2011, at 9:04 PM, Harsh J wrote: Keith, The 'hadoop fs -text' tool does decompress a file given to it if needed/able, but what you could also do is run a distributed mapreduce job that converts from compressed to decompressed, that'd be much faster. On Fri, Aug 5, 2011 at 4:58 AM, Keith Wiley kwi...@keithwiley.com wrote: Instead of hd fs -put hundreds of files of X megs, I want to do it once on a gzipped (or zipped) archive, one file, much smaller total megs. Then I want to decompress the archive on HDFS? I can't figure out what hd fs type command would do such a thing. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley -- Harsh J
streaming cacheArchive shared libraries
I can use cacheFile to load .so files into the distributed cache and it works fine (the streaming executable links against the .so and runs), but I can't get it to work with -cacheArchive. It always says it can't find the .so file. I realize that if you jar a directory, the directory will be recreated when you unjar, but I've tried jaring a file directly. It is easily verified that unjarring such a file reproduces the original file as a sibling of the jar file itself. So it seems to me that cacheArchive should have transferred the jar file to the cwd of my task, unjarred it, and produced a .so file right there, but it doesn't link up with the executable. Like I said, I know this basic approach works just fine with cacheFile. What could be the problem here? I can't easily see the files on the cluster since it is a remote cluster with limited access. I don't believe I can ssh to any individual machine to investigate the files that are created for a task...but I think I have worked through the process logically and I'm not sure what I'm doing wrong. Thoughts? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: streaming cacheArchive shared libraries
Quick followup. I substituted the true mapper for a little python script that just lists the cwd's contents and dumps them to the streaming output (stderr). Oddly, I it doesn't look like the .jar far was unpacked. I can see it there, but not the unpacked version, so it looks like -cacheArchive transferred the file but didn't unjar it. Anyone ever seen something like this before? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: streaming cacheArchive shared libraries
Hi Keith, I have tried the exact use case you have mentioned and it works fine for me. Below is the command line for the same: [ramya]$ jar vxf samplelib.jar created: META-INF/ inflated: META-INF/MANIFEST.MF inflated: libhdfs.so [ramya]$ hadoop dfs -put samplelib.jar samplelib.jar [ramya]$ hadoop jar hadoop-streaming.jar -input InputDir -mapper ls testlink/libhdfs.so -reducer NONE -output out -cacheArchive hdfs://namenode:port/user/ramya/samplelib.jar#testlink [ramya]$ hadoop dfs -cat out/* testlink/libhdfs.so testlink/libhdfs.so testlink/libhdfs.so Hope it helps. Thanks Ramya On 8/5/11 10:10 AM, Keith Wiley kwi...@keithwiley.com wrote: I can use cacheFile to load .so files into the distributed cache and it works fine (the streaming executable links against the .so and runs), but I can't get it to work with -cacheArchive. It always says it can't find the .so file. I realize that if you jar a directory, the directory will be recreated when you unjar, but I've tried jaring a file directly. It is easily verified that unjarring such a file reproduces the original file as a sibling of the jar file itself. So it seems to me that cacheArchive should have transferred the jar file to the cwd of my task, unjarred it, and produced a .so file right there, but it doesn't link up with the executable. Like I said, I know this basic approach works just fine with cacheFile. What could be the problem here? I can't easily see the files on the cluster since it is a remote cluster with limited access. I don't believe I can ssh to any individual machine to investigate the files that are created for a task...but I think I have worked through the process logically and I'm not sure what I'm doing wrong. Thoughts? Keith Wiley *kwi...@keithwiley.com* keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: streaming cacheArchive shared libraries
Okay, I think I understand. The symlink name that follows the pound sign in the -cacheArchive directive isn't the name of the transferred jar file -- it is the name of a directory that the .jar file will be put into and then be unjarred. So, it doesn't act like like jar would on a local machine, where files are recreated at the current directory level. Rather, everything is pushed down by one level. Wish a corresponding cmddev flag to point LD_LIBRARY_PATH to the correct location, I think I can get it to find the shared libraries now. On Aug 5, 2011, at 10:27 , Keith Wiley wrote: Quick followup. I substituted the true mapper for a little python script that just lists the cwd's contents and dumps them to the streaming output (stderr). Oddly, I it doesn't look like the .jar far was unpacked. I can see it there, but not the unpacked version, so it looks like -cacheArchive transferred the file but didn't unjar it. Anyone ever seen something like this before? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Re: streaming cacheArchive shared libraries
Right, so it was pushed down a level into the testlink directory. That's why my shared libraries were not linking properly to my mapper executable. I can fix that by using cmddev to redirect LD_LIBRARY_PATH. I think that'll work. On Aug 5, 2011, at 10:44 , Ramya Sunil wrote: Hi Keith, I have tried the exact use case you have mentioned and it works fine for me. Below is the command line for the same: [ramya]$ jar vxf samplelib.jar created: META-INF/ inflated: META-INF/MANIFEST.MF inflated: libhdfs.so [ramya]$ hadoop dfs -put samplelib.jar samplelib.jar [ramya]$ hadoop jar hadoop-streaming.jar -input InputDir -mapper ls testlink/libhdfs.so -reducer NONE -output out -cacheArchive hdfs://namenode:port/user/ramya/samplelib.jar#testlink [ramya]$ hadoop dfs -cat out/* testlink/libhdfs.so testlink/libhdfs.so testlink/libhdfs.so Hope it helps. Thanks Ramya Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Order of Operations
Hi, According to the attached image found on yahoo's hadoop tutorialhttp://developer.yahoo.com/hadoop/tutorial/module4.html, the order of operations is map combine partition which should be followed by reduce Here is my an example key emmited by the map operation LongValueSum:geo_US|1311722400|E 1 This should get combined with other keys as geo_US|1311722400|E 100 (assuming there are 100 keys of the same type) Then i'd like to partition the keys by the value before the first pipe(|) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 geo_US so here's my streaming command hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \ -D mapred.reduce.tasks=8 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.partitioner.options=-k1,1 \ -D stream.map.output.field.separator=\| \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input input_file \ -output output_path This is the error I get java.lang.NumberFormatException: For input string: 1311722400|E1 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468)* at org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35)* at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) I think its because the partitioner is running before the combiner. Any thoughts? -- Regards, Premal Shah.
cmdenv LD_LIBRARY_PATH
I know you can do something like this: -cmdenv LD_LIBRARY_PATH=./my_libs if you have shared libraries in a subdirectory under the cwd (such as occurs when using -cacheArchive to load and unpack a jar full of .so files into the distributed cache)...but this destroys the existing path. I think I want something more like this: -cmdenv LD_LIBRARY_PATH=./my_libs:$LD_LIBRARY_PATH but it interprets the environment variable as it constructs the command. It uses the local version of the variable and converts it as it builds the hadoop command, it doesn't send the $ version to hadoop to be converted at that later time. Is this something that can be fixed by some combination of single,double,back quotes and back slashes? I'm uncertain of the proper sequence. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Hadoop order of operations
According to the attached image found on yahoo's hadoop tutorial, the order of operations is map combine partition which should be followed by reduce Here is my an example key emmited by the map operation LongValueSum:geo_US|1311722400|E1 Assuming there are 100 keys of the same type, this should get combined as geo_US|1311722400|E 100 Then i'd like to partition the keys by the value before the first pipe(|) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29 geo_US Here's the streaming command hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \ -D mapred.reduce.tasks=8 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.partitioner.options=-k1,1 \ -D stream.map.output.field.separator=\| \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input input_file \ -output output_path This is the error I get java.lang.NumberFormatException: For input string: 1311722400|E1 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) I looks like the partitioner is running before the combiner. Any thoughts? -- View this message in context: http://old.nabble.com/Hadoop-order-of-operations-tp32205781p32205781.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: maprd vs mapreduce api
The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the identity function. So you should be able to just do conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class); conf.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class); without having to implement your own no-op classes. I recommend reading the javadoc for differences between the old api and the new api, for example http://hadoop.apache.org/common/docs/r0.20.2/api/index.html indicates the different functionality of Mapper in the new api and it's dual use as the identity mapper. Cheers, --Keith On Aug 5, 2011, at 1:15 PM, garpinc wrote: I was following this tutorial on version 0.19.1 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html I however wanted to use the latest version of api 0.20.2 The original code in tutorial had following lines conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); both Identity classes are deprecated.. So seemed the solution was to create mapper and reducer as follows: public static class NOOPMapper extends MapperText, IntWritable, Text, IntWritable{ public void map(Text key, IntWritable value, Context context ) throws IOException, InterruptedException { context.write(key, value); } } public static class NOOPReducer extends ReducerText,IntWritable,Text,IntWritable { private IntWritable result = new IntWritable(); public void reduce(Text key, IterableIntWritable values, Context context ) throws IOException, InterruptedException { context.write(key, result); } } And then with code: Configuration conf = new Configuration(); Job job = new Job(conf, testdriver); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(In)); FileOutputFormat.setOutputPath(job, new Path(Out)); job.setMapperClass(NOOPMapper.class); job.setReducerClass(NOOPReducer.class); job.waitForCompletion(true); However I get this message java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text at TestDriver$NOOPMapper.map(TestDriver.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 11/08/01 16:41:01 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0 Can anyone tell me what I need for this to work. Attached is full code.. http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java -- View this message in context: http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: maprd vs mapreduce api
On Fri, Aug 5, 2011 at 3:42 PM, Stevens, Keith D. steven...@llnl.gov wrote: The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the identity function. So you should be able to just do conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class); conf.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class); without having to implement your own no-op classes. I recommend reading the javadoc for differences between the old api and the new api, for example http://hadoop.apache.org/common/docs/r0.20.2/api/index.html indicates the different functionality of Mapper in the new api and it's dual use as the identity mapper. Sorry for asking on this thread :) Does Definitive Guide 2 cover the new api? Cheers, --Keith On Aug 5, 2011, at 1:15 PM, garpinc wrote: I was following this tutorial on version 0.19.1 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html I however wanted to use the latest version of api 0.20.2 The original code in tutorial had following lines conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); both Identity classes are deprecated.. So seemed the solution was to create mapper and reducer as follows: public static class NOOPMapper extends MapperText, IntWritable, Text, IntWritable{ public void map(Text key, IntWritable value, Context context ) throws IOException, InterruptedException { context.write(key, value); } } public static class NOOPReducer extends ReducerText,IntWritable,Text,IntWritable { private IntWritable result = new IntWritable(); public void reduce(Text key, IterableIntWritable values, Context context ) throws IOException, InterruptedException { context.write(key, result); } } And then with code: Configuration conf = new Configuration(); Job job = new Job(conf, testdriver); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(In)); FileOutputFormat.setOutputPath(job, new Path(Out)); job.setMapperClass(NOOPMapper.class); job.setReducerClass(NOOPReducer.class); job.waitForCompletion(true); However I get this message java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text at TestDriver$NOOPMapper.map(TestDriver.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 11/08/01 16:41:01 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0 Can anyone tell me what I need for this to work. Attached is full code.. http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java -- View this message in context: http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
java.io.IOException: config()
Hi, I have been struck with this exception: java.io.IOException: config() at org.apache.hadoop.conf.Configuration.(Configuration.java:211) at org.apache.hadoop.conf.Configuration.(Configuration.java:198) at org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:99) at test.TestApp.main(TestApp.java:19) 05Aug2011 20:08:53,303 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_-1591195062, ugi=jagarandas,staff,com.apple.sharepoint.group.1,_developer,_lpoperator,_lpadmin,_appserveradm,admin,_appserverusr,localaccounts,everyone,fmsadmin,com.apple.access_screensharing,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3]: java.lang.Throwable: for testing 05Aug2011 20:08:53,315 DEBUG [listenerContainer-1] (DFSClient.java:3012) - DFSClient writeChunk allocating new packet seqno=0, src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_222812011-08-05-20-08-52, packetSize=65557, chunksPerPacket=127, bytesCurBlock=0 I saw the source code : public Configuration(boolean loadDefaults) { this.loadDefaults = loadDefaults; if (LOG.isDebugEnabled()) { LOG.debug(StringUtils.stringifyException(new IOException(config(; } synchronized(Configuration.class) { REGISTRY.put(this, null); } } Log is in debug mode. Can anyone please help me on this?? Regards, JD
Re: Hadoop order of operations
Premal, Didn't go through your entire thread, but the right order is: map (N) - partition (N) - combine (0…N). On Sat, Aug 6, 2011 at 4:04 AM, Premal premal.j.s...@gmail.com wrote: According to the attached image found on yahoo's hadoop tutorial, the order of operations is map combine partition which should be followed by reduce Here is my an example key emmited by the map operation LongValueSum:geo_US|1311722400|E 1 Assuming there are 100 keys of the same type, this should get combined as geo_US|1311722400|E 100 Then i'd like to partition the keys by the value before the first pipe(|) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29 geo_US Here's the streaming command hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar \ -D mapred.reduce.tasks=8 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.partitioner.options=-k1,1 \ -D stream.map.output.field.separator=\| \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -combiner org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \ -input input_file \ -output output_path This is the error I get java.lang.NumberFormatException: For input string: 1311722400|E 1 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Long.parseLong(Long.java:419) at java.lang.Long.parseLong(Long.java:468) at org.apache.hadoop.mapred.lib.aggregate.LongValueSum.addNextValue(LongValueSum.java:48) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:59) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer.reduce(ValueAggregatorReducer.java:35) at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1349) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1435) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1297) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) I looks like the partitioner is running before the combiner. Any thoughts? -- View this message in context: http://old.nabble.com/Hadoop-order-of-operations-tp32205781p32205781.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Harsh J
Re: java.io.IOException: config() IMP
Hi, I am using CDH3. I need to stream huge amount of data from our application to hadoop. I am opening a connection like config.set(fs.default.name,hdfsURI); FileSystem dfs = FileSystem.get(config); String path = hdfsURI + connectionKey; Path destPath = new Path(path); logger.debug(Path -- + destPath.getName()); outStream = dfs.create(destPath); and keeping the outStream open for some time and writing continuously through it and then closing it. But it is throwing 5Aug2011 21:36:48,550 DEBUG [LeaseChecker@DFSClient[clientName=DFSClient_218151655, ugi=jagarandas]: java.lang.Throwable: for testing at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.toString(DFSClient.java:1181) at org.apache.hadoop.util.Daemon.init(Daemon.java:38) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.put(DFSClient.java:1094) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:219) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:584) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:565) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:472) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:464) at com.abc.ireporter.common.persistence.ConnectionManager.createConnection(ConnectionManager.java:66) at com.abc.ireporter.common.persistence.HDPPersistor.writeToHDP(HDPPersistor.java:93) at com.abc.ireporter.datatransformer.translator.HDFSTranslator.persistData(HDFSTranslator.java:41) at com.abc.ireporter.datatransformer.adapter.TranslatorAdapter.processData(TranslatorAdapter.java:61) at com.abc.ireporter.datatransformer.DefaultMessageListener.persistValidatedData(DefaultMessageListener.java:276) at com.abc.ireporter.datatransformer.DefaultMessageListener.onMessage(DefaultMessageListener.java:93) at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:506) at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:463) at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:435) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:322) at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:260) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:944) at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:868) at java.lang.Thread.run(Thread.java:680) ] (RPC.java:230) - Call: renewLease 4 05Aug2011 21:36:48,550 DEBUG [listenerContainer-1] (DFSClient.java:3274) - DFSClient writeChunk allocating new packet seqno=0, src=/home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819, packetSize=65557, chunksPerPacket=127, bytesCurBlock=0 05Aug2011 21:36:48,551 DEBUG [Thread-11] (DFSClient.java:2499) - Allocating new block 05Aug2011 21:36:48,552 DEBUG [sendParams-0] (Client.java:761) - IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas sending #3 05Aug2011 21:36:48,553 DEBUG [IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas] (Client.java:815) - IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas got value #3 05Aug2011 21:36:48,556 DEBUG [Thread-11] (RPC.java:230) - Call: addBlock 4 05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3094) - pipeline = 127.0.0.1:50010 05Aug2011 21:36:48,557 DEBUG [Thread-11] (DFSClient.java:3102) - Connecting to 127.0.0.1:50010 05Aug2011 21:36:48,559 DEBUG [Thread-11] (DFSClient.java:3109) - Send buf size 131072 05Aug2011 21:36:48,635 DEBUG [DataStreamer for file /home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819 block blk_-5183404460805094255_1042] (DFSClient.java:2533) - DataStreamer block blk_-5183404460805094255_1042 wrote packet seqno:0 size:1522 offsetInBlock:0 lastPacketInBlock:true 05Aug2011 21:36:48,638 DEBUG [ResponseProcessor for block blk_-5183404460805094255_1042] (DFSClient.java:2640) - DFSClient Replies for seqno 0 are SUCCESS 05Aug2011 21:36:48,639 DEBUG [DataStreamer for file /home/hadoop/listenerContainer-1jagaran-dass-macbook-pro.local_247811312605307819 block blk_-5183404460805094255_1042] (DFSClient.java:2563) - Closing old block blk_-5183404460805094255_1042 05Aug2011 21:36:48,645 DEBUG [sendParams-0] (Client.java:761) - IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas sending #4 05Aug2011 21:36:48,647 DEBUG [IPC Client (47) connection to localhost/127.0.0.1:8020 from jagarandas] (Client.java:815) - IPC Client (47) connection to