Re: Platform reliability with Hadoop

lohit . vijayarenu Sun, 20 Jan 2008 11:44:22 -0800

you might want to change hadoop.tmp.dir entry alone. since others are derived 
out of this, everything should be fine.
i am wondering if hadoop.tmp.dir might be used elsewhere
thanks,
lohit

----- Original Message ----
From: Jeff Eastman <[EMAIL PROTECTED]>
To: hadoop-user@lucene.apache.org
Sent: Sunday, January 20, 2008 11:05:28 AM
Subject: RE: Platform reliability with Hadoop

I am almost operational again but something in my configuration is
 still
not quite right. Here's what I did:

- I created a directory /u1/cloud-data on every machine's local disk
- I created a new user 'hadoop' who owns cloud-data
- I used that directory to replace the hadoop.tmp.dir entries for:
  - mapred.system.dir
  - mapred.local.dir
  - dfs.name.dir
  - dfs.data.dir
- The other tmp.dir config entries are unchanged
- The hadoop_install directory is NFS mounted on all machines
- My name node is on cu027 and my job tracker is on cu063
- I launched the dfs and mapred processes as 'hadoop'
- I uploaded my data to the dfs as user 'jeastman'
- The files are visible in /users/jeastman when I ls as 'jeastman'
- When I submit a job as 'jeastman' that used to run, it runs but
 cannot
locate any input data so it quits immediately with this in the Map
Completion Graph display:
XML Parsing Error: no element found
Location:
http://cu063.cubit.sp.collab.net:50030/taskgraph?type=map&jobid=job_2008
01182307_0003
Line Number 1, Column 1:

I've attached my site.xml file.
Jeff
-----Original Message-----
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Platform reliability with Hadoop

The /tmp default has caught us once or twice too. Now we put the files 
elsewhere.

[EMAIL PROTECTED] wrote:
>> The DFS is stored in /tmp on each box. 
>> The developers who own the machines occasionally reboot and
 reprofile
them
>>     
>
> Wont you lose your blocks after reboot since /tmp gets cleaned up?
Could this be the reason you see data corruption?
> Good idea is to configure DFS to be any place other than /tmp 
>
> Thanks,
> Lohit
> ----- Original Message ----
> From: Jeff Eastman <[EMAIL PROTECTED]>
> To: hadoop-user@lucene.apache.org
> Sent: Wednesday, January 16, 2008 9:32:41 AM
> Subject: Platform reliability with Hadoop
>
>
> I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
> machines in our CUBiT array for the last month. During this time I
have
> experienced two major data corruption losses on relatively small
>  amounts
> of data (<50gb) that make me wonder about the suitability of this
> platform for hosting Hadoop. CUBiT is one of our products for
 managing
>  a
> pool of development servers, allowing developers to check out
machines,
> install various OS profiles on them and monitor their utilization via
> the web. With most machines reporting very low utilization it seemed
 a
> natural place to run Hadoop in the background. I have an NFS-mounted
> account on all of the machines and have installed Hadoop there. The
DFS
> is stored in /tmp on each box. The developers who own the machines
> occasionally reboot and reprofile them, but this occurs infrequently
>  and
> does not clobber /tmp. Hadoop is designed to deal with slave failures
>  of
> this nature, though this platform may well be an acid test.
>
>  
>
> My initial cloud was configured for replication factor of 3 and I
 have
> increased that now to 4 in hopes of improving data reliability in the
> face of these more-prevalent slave outages. Ted Dunning has suggested
> aggressive rebalancing in his recent posts and I have done this by
> increasing replication to 5 (from 3) and then dropping it to 4. Are
> there other rebalancing or configuration techniques that might
 improve
> my data reliability? Or, is this platform just too unstable to be a
>  good
> fit for Hadoop?
>
>  
>
> Jeff
>
>
>
>
>   

-----Inline Attachment Follows-----

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!--- global properties -->

<property>
  <name>mapred.system.dir</name>
  <value>/u1/cloud-data/mapred/system</value>
  <description>The shared directory where MapReduce stores control
 files.
  </description>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/u1/cloud-data/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.job.tracker.info.port</name>
  <value>50030</value>
  <description>The port that the MapReduce job tracker info webserver
 runs at.
  </description>
</property>

<property>
  <name>dfs.secondary.info.port</name>
  <value>50090</value>
  <description>The base number for the Secondary namenode info port.
  </description>
</property>

<property>
  <name>dfs.datanode.port</name>
  <value>50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting
               point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>dfs.info.port</name>
  <value>50070</value>
  <description>The base port number for the dfs namenode web ui.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<!-- file system properties -->

<property>
   <name>fs.default.name</name>
   <value>hdfs://cu027.cubit.sp.collab.net:54310</value>
   <description>The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
   </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/u1/cloud-data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name
 node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/u1/cloud-data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data
 node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.datanode.du.reserved</name>
  <value>0</value>
  <description>Reserved space in bytes per volume. Always leave this
 much space free for non dfs use.
  </description>
</property>

<property>
  <name>dfs.datanode.du.pct</name>
  <value>0.50f</value>
  <description>When calculating remaining space, only use this
 percentage of the real available space
  </description>
</property>

<property> 
   <name>dfs.replication</name>
   <value>4</value>
   <description>Default block replication.
   The actual number of replications can be specified when the file is
 created.
   The default is used if replication is not specified in create time.
   </description>
</property>

<!-- map/reduce properties -->

<property>
   <name>mapred.job.tracker</name>
   <value>cu063.cubit.sp.collab.net:54311</value>
   <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
   </description>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx512m</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>31</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>11</value>
  <description>The default number of reduce tasks per job.  Typically
 set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

</configuration>

Re: Platform reliability with Hadoop

Reply via email to