I should add, when I run my job it indicates it could not find the input files:
[EMAIL PROTECTED] jeastman]$ $HADOOP_INSTALL/bin/hadoop jar ~/access0.jar com.collabnet.hadoop.access.Access0Driver ecn/access ecn-out 08/01/21 10:59:39 INFO mapred.FileInputFormat: Total input paths to process: 0 08/01/21 10:59:46 INFO mapred.JobClient: Running job: job_200801182307_0005 08/01/21 10:59:47 INFO mapred.JobClient: map 100% reduce 100% 08/01/21 10:59:48 INFO mapred.JobClient: Job complete: job_200801182307_0005 08/01/21 10:59:49 INFO mapred.JobClient: Counters: 0 I tried using full paths for them (/users/jeastman/...) but that throws 'input path does not exist' errors. Jeff -----Original Message----- From: Jeff Eastman [mailto:[EMAIL PROTECTED] Sent: Monday, January 21, 2008 11:15 AM To: hadoop-user@lucene.apache.org Subject: RE: Platform reliability with Hadoop Is it really that simple? The Wiki page GettingStartedWithHadoop recommends setting dfs.name.dir, dfs.data.dir, dfs.client.buffer.dir and mapred.local.dir to "appropriate" values (without giving an example). Should these be fixed (XX) or variable (XX-${user.name}) values? The FAQ page recommends setting the mapred.system.dir to a fixed value (e.g. /hadoop/mapred/system), so I chose fixed values too. - dfs.name.dir - /u1/cloud-data - dfs.data.dir - /u1/cloud-data - mapred.system.dir - /u1/cloud-data - mapred.local.dir - /u1/cloud-data I did not overwrite the dfs.client.buffer.dir (Determines where on the local filesystem an DFS client should store its blocks before it sends them to the datanode) because my 'jeastman' client could not put data into the dfs with it set to the fixed value. There are 4 other settings that use the ${hadoop.tmp.dir}, and these seem appropriately tmp-ish. I did not redefine them: - fs.trash.root - The trash directory, used by FsShell's 'rm' command. - fs.checkpoint.dir - Determines where on the local filesystem the DFS secondary name node should store the temporary images and edits to merge. - fs.s3.buffer.dir - Determines where on the local filesystem the S3 filesystem should store its blocks before it sends them to S3 or after it retrieves them from S3. - mapred.temp.dir - A shared directory for temporary files. Jeff -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Sunday, January 20, 2008 11:44 AM To: hadoop-user@lucene.apache.org Subject: Re: Platform reliability with Hadoop you might want to change hadoop.tmp.dir entry alone. since others are derived out of this, everything should be fine. i am wondering if hadoop.tmp.dir might be used elsewhere thanks, lohit ----- Original Message ---- From: Jeff Eastman <[EMAIL PROTECTED]> To: hadoop-user@lucene.apache.org Sent: Sunday, January 20, 2008 11:05:28 AM Subject: RE: Platform reliability with Hadoop I am almost operational again but something in my configuration is still not quite right. Here's what I did: - I created a directory /u1/cloud-data on every machine's local disk - I created a new user 'hadoop' who owns cloud-data - I used that directory to replace the hadoop.tmp.dir entries for: - mapred.system.dir - mapred.local.dir - dfs.name.dir - dfs.data.dir - The other tmp.dir config entries are unchanged - The hadoop_install directory is NFS mounted on all machines - My name node is on cu027 and my job tracker is on cu063 - I launched the dfs and mapred processes as 'hadoop' - I uploaded my data to the dfs as user 'jeastman' - The files are visible in /users/jeastman when I ls as 'jeastman' - When I submit a job as 'jeastman' that used to run, it runs but cannot locate any input data so it quits immediately with this in the Map Completion Graph display: XML Parsing Error: no element found Location: http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008 01182307_0003 Line Number 1, Column 1: I've attached my site.xml file. Jeff -----Original Message----- From: Jason Venner [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 16, 2008 10:04 AM To: hadoop-user@lucene.apache.org Subject: Re: Platform reliability with Hadoop The /tmp default has caught us once or twice too. Now we put the files elsewhere. [EMAIL PROTECTED] wrote: >> The DFS is stored in /tmp on each box. >> The developers who own the machines occasionally reboot and reprofile them >> > > Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption? > Good idea is to configure DFS to be any place other than /tmp > > Thanks, > Lohit > ----- Original Message ---- > From: Jeff Eastman <[EMAIL PROTECTED]> > To: hadoop-user@lucene.apache.org > Sent: Wednesday, January 16, 2008 9:32:41 AM > Subject: Platform reliability with Hadoop > > > I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen > machines in our CUBiT array for the last month. During this time I have > experienced two major data corruption losses on relatively small > amounts > of data (<50gb) that make me wonder about the suitability of this > platform for hosting Hadoop. CUBiT is one of our products for managing > a > pool of development servers, allowing developers to check out machines, > install various OS profiles on them and monitor their utilization via > the web. With most machines reporting very low utilization it seemed a > natural place to run Hadoop in the background. I have an NFS-mounted > account on all of the machines and have installed Hadoop there. The DFS > is stored in /tmp on each box. The developers who own the machines > occasionally reboot and reprofile them, but this occurs infrequently > and > does not clobber /tmp. Hadoop is designed to deal with slave failures > of > this nature, though this platform may well be an acid test. > > > > My initial cloud was configured for replication factor of 3 and I have > increased that now to 4 in hopes of improving data reliability in the > face of these more-prevalent slave outages. Ted Dunning has suggested > aggressive rebalancing in his recent posts and I have done this by > increasing replication to 5 (from 3) and then dropping it to 4. Are > there other rebalancing or configuration techniques that might improve > my data reliability? Or, is this platform just too unstable to be a > good > fit for Hadoop? > > > > Jeff > > > > > -----Inline Attachment Follows----- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <!--- global properties --> <property> <name>mapred.system.dir</name> <value>/u1/cloud-data/mapred/system</value> <description>The shared directory where MapReduce stores control files. </description> </property> <property> <name>mapred.local.dir</name> <value>/u1/cloud-data/mapred/local</value> <description>The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. </description> </property> <property> <name>mapred.job.tracker.info.port</name> <value>50030</value> <description>The port that the MapReduce job tracker info webserver runs at. </description> </property> <property> <name>dfs.secondary.info.port</name> <value>50090</value> <description>The base number for the Secondary namenode info port. </description> </property> <property> <name>dfs.datanode.port</name> <value>50010</value> <description>The port number that the dfs datanode server uses as a starting point to look for a free port to listen on. </description> </property> <property> <name>dfs.info.port</name> <value>50070</value> <description>The base port number for the dfs namenode web ui. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <!-- file system properties --> <property> <name>fs.default.name</name> <value>hdfs://cu027.cubit.sp.collab.net:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> <property> <name>dfs.name.dir</name> <value>/u1/cloud-data/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.data.dir</name> <value>/u1/cloud-data/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.datanode.du.reserved</name> <value>0</value> <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use. </description> </property> <property> <name>dfs.datanode.du.pct</name> <value>0.50f</value> <description>When calculating remaining space, only use this percentage of the real available space </description> </property> <property> <name>dfs.replication</name> <value>4</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <!-- map/reduce properties --> <property> <name>mapred.job.tracker</name> <value>cu063.cubit.sp.collab.net:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> </property> <property> <name>mapred.map.tasks</name> <value>31</value> <description>The default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.reduce.tasks</name> <value>11</value> <description>The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> </configuration>