reducers and data locality
Hello folks, I have a lot of input splits (10k-50k - 128 mb blocks) which contains text files. I need to process those line by line, then copy the result into roughly equal size of shards. So i generate a random key (from a range of [0:numberOfShards]) which is used to route the map output to different reducers and the size is more less equal. I know that this is not really efficient and i was wondering if i could somehow control how keys are routed. For example could i generate the randomKeys with hostname prefixes and control which keys are sent to each reducer? What do you think? Kind regards Mete
Re: Hbql with Hbase-0.90.4
Hi, I am trying to install Hbql on pseudo distributed node. I am not sure how to build the *hbase-trx-0.90.0-DEV-2.jar* from hbase-transactional package which was downloaded from * https://github.com/hbase-trx/hbase-transactional-tableindexed* Appreciate your help on the same. -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in -- Thanks Regards *Manu S* SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in
Re: reducers and data locality
Hi Mete A custom Paritioner class can control the flow of keys to the desired reducer. It gives you more control on which key to which reducer. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: mete efk...@gmail.com Date: Fri, 27 Apr 2012 09:19:21 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: reducers and data locality Hello folks, I have a lot of input splits (10k-50k - 128 mb blocks) which contains text files. I need to process those line by line, then copy the result into roughly equal size of shards. So i generate a random key (from a range of [0:numberOfShards]) which is used to route the map output to different reducers and the size is more less equal. I know that this is not really efficient and i was wondering if i could somehow control how keys are routed. For example could i generate the randomKeys with hostname prefixes and control which keys are sent to each reducer? What do you think? Kind regards Mete
Re: Namenode not formatted after format
Unfortunately in 1.x the format command's prompt is case-sensitive (Fixed in 2.x): You had: Re-format filesystem in /app/hadoop/name ? (Y or N) y Format aborted in /app/hadoop/name Answer with a capital Y instead and it won't abort. On Fri, Apr 27, 2012 at 3:07 PM, Mathias Schnydrig smath...@ee.ethz.ch wrote: Hi I am setting up a hadoop cluster with 5 slaves and a master, after the single node installation it worked fine, but after I went to a multinode cluster the namenode prints this message even after formatting the hadoop namenode: see below I also added the config files. The folders I am using exist on all nodes and the hadoop folder is placed also on all nodes in the same folder and the config files are all the same. I suppose it is some stupid error I did, as I am quite new to hadoop. Regards Mathias hduser@POISN-server:/usr/local/hadoop$ bin/hadoop namenode -format Warning: $HADOOP_HOME is deprecated. 12/04/27 10:39:52 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = POISN-server/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 / Re-format filesystem in /app/hadoop/name ? (Y or N) y Format aborted in /app/hadoop/name 12/04/27 10:39:55 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at POISN-server/127.0.1.1 / hduser@POISN-server:/usr/local/hadoop$ bin/hadoop namenode Warning: $HADOOP_HOME is deprecated. 12/04/27 10:40:04 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = POISN-server/127.0.1.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 / 12/04/27 10:40:04 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 12/04/27 10:40:04 INFO impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 12/04/27 10:40:04 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 12/04/27 10:40:04 INFO impl.MetricsSystemImpl: NameNode metrics system started 12/04/27 10:40:05 INFO impl.MetricsSourceAdapter: MBean for source ugi registered. 12/04/27 10:40:05 WARN impl.MetricsSystemImpl: Source name ugi already exists! 12/04/27 10:40:05 INFO impl.MetricsSourceAdapter: MBean for source jvm registered. 12/04/27 10:40:05 INFO impl.MetricsSourceAdapter: MBean for source NameNode registered. 12/04/27 10:40:05 INFO util.GSet: VM type = 64-bit 12/04/27 10:40:05 INFO util.GSet: 2% max memory = 17.77875 MB 12/04/27 10:40:05 INFO util.GSet: capacity = 2^21 = 2097152 entries 12/04/27 10:40:05 INFO util.GSet: recommended=2097152, actual=2097152 12/04/27 10:40:05 INFO namenode.FSNamesystem: fsOwner=hduser 12/04/27 10:40:05 INFO namenode.FSNamesystem: supergroup=supergroup 12/04/27 10:40:05 INFO namenode.FSNamesystem: isPermissionEnabled=true 12/04/27 10:40:05 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 12/04/27 10:40:05 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 12/04/27 10:40:05 INFO namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 12/04/27 10:40:05 INFO namenode.NameNode: Caching file names occuring more than 10 times 12/04/27 10:40:05 ERROR namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:325) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) ... 12/04/27 10:40:05 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at POISN-server/127.0.1.1 / hduser@POISN-server:/usr/local/hadoop/conf$ cat *-site.xml core-site.xml hdfs-site.xml mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://ClusterMaster:9000/value /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in
cygwin single node setup
Hi, I am pretty a newbie and i am following the quick start guide for single node set up on windows using cygwin. In this step, $ bin/hadoop fs -put conf input I am getting the following errors. I have got no files under /user/EXT0125622/input/conf/capacity-scheduler.xml. That might be a reason for the errors i get but why does hadoop look for such directory as i have not configured anything like that. so supposably, hadoop is making up and looking for such file and directory? Any idea and help is welcome. Cheers Onder 12/04/27 13:44:37 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/EXT0125622/input/conf/capacity-scheduler.xml could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826) 12/04/27 13:44:37 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 12/04/27 13:44:37 WARN hdfs.DFSClient: Could not get block locations. Source file /user/EXT0125622/input/conf/capacity-scheduler.xml - Aborting... put: java.io.IOException: File /user/EXT0125622/input/conf/capacity-scheduler.xml could only be replicated to 0 nodes, instead of 1 12/04/27 13:44:37 ERROR hdfs.DFSClient: Exception closing file /user/EXT0125622/input/conf/capacity-scheduler.xml : org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/EXT0125622/input/conf/capacity-scheduler.xml could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/EXT0125622/input/conf/capacity-scheduler.xml could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
Re: DFSClient error
Can you run a regular 'hadoop fs' (put orls or get) command? If yes, how about a wordcount example? 'path/hadoop jar pathhadoop-*examples*.jar wordcount input output' -Original Message- From: Mohit Anchlia mohitanch...@gmail.com Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Fri, 27 Apr 2012 14:36:49 -0700 To: common-user@hadoop.apache.org common-user@hadoop.apache.org Subject: Re: DFSClient error I even tried to reduce number of jobs but didn't help. This is what I see: datanode logs: Initializing secure datanode resources Successfully obtained privileged resources (streaming port = ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port = sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075]) Starting regular datanode initialization 26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value of 143 userlogs: 2012-04-26 19:35:22,801 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available 2012-04-26 19:35:22,801 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded 2012-04-26 19:35:22,808 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded initialized native-zlib library 2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.197:50010, add to deadNodes and continue java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien t.java:1664) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j ava:2383) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java :2056) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr essorStream.java:97) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt ream.java:87) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j ava:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe cordReader.java:114) at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead er.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT ask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) 2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.204:50010, add to deadNodes and continue java.io.EOFException namenode logs: 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job job_201204261140_0244 added successfully for user 'hadoop' to queue 'default' 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201204261140_0244 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger: USER=hadoop IP=125.18.62.196OPERATION=SUBMIT_JOB TARGET=job_201204261140_0244RESULT=SUCCESS 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201204261140_0244 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad connect ack with firstBadLink as 125.18.62.197:50010 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_2499580289951080275_22499 2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 125.18.62.197:50010 2012-04-26 16:12:53,594 INFO org.apache.hadoop.mapred.JobInProgress: jobToken generated and stored with users keys in /data/hadoop/mapreduce/job_201204261140_0244/jobToken 2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201204261140_0244 = 73808305. Number of splits = 1 2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress:
Re: DFSClient error
After all the jobs fail I can't run anything. Once I restart the cluster I am able to run other jobs with no problems, hadoop fs and other io intensive jobs run just fine. On Fri, Apr 27, 2012 at 3:12 PM, John George john...@yahoo-inc.com wrote: Can you run a regular 'hadoop fs' (put orls or get) command? If yes, how about a wordcount example? 'path/hadoop jar pathhadoop-*examples*.jar wordcount input output' -Original Message- From: Mohit Anchlia mohitanch...@gmail.com Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Fri, 27 Apr 2012 14:36:49 -0700 To: common-user@hadoop.apache.org common-user@hadoop.apache.org Subject: Re: DFSClient error I even tried to reduce number of jobs but didn't help. This is what I see: datanode logs: Initializing secure datanode resources Successfully obtained privileged resources (streaming port = ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port = sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075]) Starting regular datanode initialization 26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value of 143 userlogs: 2012-04-26 19:35:22,801 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available 2012-04-26 19:35:22,801 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded 2012-04-26 19:35:22,808 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded initialized native-zlib library 2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.197:50010, add to deadNodes and continue java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien t.java:1664) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j ava:2383) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java :2056) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr essorStream.java:97) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt ream.java:87) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j ava:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe cordReader.java:114) at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead er.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT ask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) 2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.204:50010, add to deadNodes and continue java.io.EOFException namenode logs: 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job job_201204261140_0244 added successfully for user 'hadoop' to queue 'default' 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201204261140_0244 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger: USER=hadoop IP=125.18.62.196OPERATION=SUBMIT_JOB TARGET=job_201204261140_0244RESULT=SUCCESS 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201204261140_0244 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad connect ack with firstBadLink as 125.18.62.197:50010 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_2499580289951080275_22499 2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 125.18.62.197:50010 2012-04-26 16:12:53,594 INFO
Node-wide Combiner
Hi all, I am a newbie in Hadoop and I like the system. I have one question: Is there a node-wide combiner or something similar in Hadoop? I think it can reduce the number of intermediate results in further. Any hint? Thanks a lot! Superymk