you might want to change hadoop.tmp.dir entry alone. since others are derived out of this, everything should be fine. i am wondering if hadoop.tmp.dir might be used elsewhere thanks, lohit
----- Original Message ---- From: Jeff Eastman <[EMAIL PROTECTED]> To: hadoop-user@lucene.apache.org Sent: Sunday, January 20, 2008 11:05:28 AM Subject: RE: Platform reliability with Hadoop I am almost operational again but something in my configuration is still not quite right. Here's what I did: - I created a directory /u1/cloud-data on every machine's local disk - I created a new user 'hadoop' who owns cloud-data - I used that directory to replace the hadoop.tmp.dir entries for: - mapred.system.dir - mapred.local.dir - dfs.name.dir - dfs.data.dir - The other tmp.dir config entries are unchanged - The hadoop_install directory is NFS mounted on all machines - My name node is on cu027 and my job tracker is on cu063 - I launched the dfs and mapred processes as 'hadoop' - I uploaded my data to the dfs as user 'jeastman' - The files are visible in /users/jeastman when I ls as 'jeastman' - When I submit a job as 'jeastman' that used to run, it runs but cannot locate any input data so it quits immediately with this in the Map Completion Graph display: XML Parsing Error: no element found Location: http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008 01182307_0003 Line Number 1, Column 1: I've attached my site.xml file. Jeff -----Original Message----- From: Jason Venner [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 16, 2008 10:04 AM To: hadoop-user@lucene.apache.org Subject: Re: Platform reliability with Hadoop The /tmp default has caught us once or twice too. Now we put the files elsewhere. [EMAIL PROTECTED] wrote: >> The DFS is stored in /tmp on each box. >> The developers who own the machines occasionally reboot and reprofile them >> > > Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption? > Good idea is to configure DFS to be any place other than /tmp > > Thanks, > Lohit > ----- Original Message ---- > From: Jeff Eastman <[EMAIL PROTECTED]> > To: hadoop-user@lucene.apache.org > Sent: Wednesday, January 16, 2008 9:32:41 AM > Subject: Platform reliability with Hadoop > > > I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen > machines in our CUBiT array for the last month. During this time I have > experienced two major data corruption losses on relatively small > amounts > of data (<50gb) that make me wonder about the suitability of this > platform for hosting Hadoop. CUBiT is one of our products for managing > a > pool of development servers, allowing developers to check out machines, > install various OS profiles on them and monitor their utilization via > the web. With most machines reporting very low utilization it seemed a > natural place to run Hadoop in the background. I have an NFS-mounted > account on all of the machines and have installed Hadoop there. The DFS > is stored in /tmp on each box. The developers who own the machines > occasionally reboot and reprofile them, but this occurs infrequently > and > does not clobber /tmp. Hadoop is designed to deal with slave failures > of > this nature, though this platform may well be an acid test. > > > > My initial cloud was configured for replication factor of 3 and I have > increased that now to 4 in hopes of improving data reliability in the > face of these more-prevalent slave outages. Ted Dunning has suggested > aggressive rebalancing in his recent posts and I have done this by > increasing replication to 5 (from 3) and then dropping it to 4. Are > there other rebalancing or configuration techniques that might improve > my data reliability? Or, is this platform just too unstable to be a > good > fit for Hadoop? > > > > Jeff > > > > > -----Inline Attachment Follows----- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <!--- global properties --> <property> <name>mapred.system.dir</name> <value>/u1/cloud-data/mapred/system</value> <description>The shared directory where MapReduce stores control files. </description> </property> <property> <name>mapred.local.dir</name> <value>/u1/cloud-data/mapred/local</value> <description>The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. </description> </property> <property> <name>mapred.job.tracker.info.port</name> <value>50030</value> <description>The port that the MapReduce job tracker info webserver runs at. </description> </property> <property> <name>dfs.secondary.info.port</name> <value>50090</value> <description>The base number for the Secondary namenode info port. </description> </property> <property> <name>dfs.datanode.port</name> <value>50010</value> <description>The port number that the dfs datanode server uses as a starting point to look for a free port to listen on. </description> </property> <property> <name>dfs.info.port</name> <value>50070</value> <description>The base port number for the dfs namenode web ui. </description> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <!-- file system properties --> <property> <name>fs.default.name</name> <value>hdfs://cu027.cubit.sp.collab.net:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> <property> <name>dfs.name.dir</name> <value>/u1/cloud-data/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.data.dir</name> <value>/u1/cloud-data/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.datanode.du.reserved</name> <value>0</value> <description>Reserved space in bytes per volume. Always leave this much space free for non dfs use. </description> </property> <property> <name>dfs.datanode.du.pct</name> <value>0.50f</value> <description>When calculating remaining space, only use this percentage of the real available space </description> </property> <property> <name>dfs.replication</name> <value>4</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <!-- map/reduce properties --> <property> <name>mapred.job.tracker</name> <value>cu063.cubit.sp.collab.net:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> </property> <property> <name>mapred.map.tasks</name> <value>31</value> <description>The default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> <property> <name>mapred.reduce.tasks</name> <value>11</value> <description>The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". </description> </property> </configuration>