[Hadoop Wiki] Trivial Update of "TestFaqPage" by SomeOt herAccount

Apache Wiki Wed, 06 Oct 2010 15:05:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "TestFaqPage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/TestFaqPage?action=diff&rev1=2&rev2=3

--------------------------------------------------

   * 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/c++/libhdfs|libhdfs]], a 
JNI-based C API for talking to hdfs (only).
   * 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop
 Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to 
write map-reduce jobs.
  
- <<BR>> <<Anchor(2.2)>> '''2. [[#A2.2|What is the Distributed Cache used 
for?]]'''
+ == What is the Distributed Cache used for? ==
  
  The distributed cache is used to distribute large read-only files that are 
needed by map/reduce jobs to the cluster. The framework will copy the necessary 
files from a url (either hdfs: or http:) on to the slave node before any tasks 
for the job are executed on that node. The files are only copied once per job 
and so should not be modified by the application.
  
- <<BR>> <<Anchor(2.3)>> '''3. [[#A2.3|Can I write create/write-to hdfs files 
directly from my map/reduce tasks?]]'''
+ == Can I write create/write-to hdfs files directly from my map/reduce tasks? 
==
  
  Yes. (Clearly, you want this since you need to create/write-to files other 
than the output-file written out by 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
  
  Caveats:
  
- <glossary>
- 
  ${mapred.output.dir} is the eventual output directory for the job 
([[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]]
 / 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()|JobConf.getOutputPath]]).
  
  ${taskid} is the actual id of the individual task-attempt (e.g. 
task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. 
task_200709221812_0001_m_000000).
  
- </glossary>
- 
  With ''speculative-execution'' '''on''', one could face issues with 2 
instances of the same TIP (running simultaneously) trying to open/write-to the 
same file (path) on hdfs. Hence the app-writer will have to pick unique names 
(e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per 
task-attempt, not just per TIP. (Clearly, this needs to be done even if the 
user doesn't create/write-to files directly via reduce tasks.)
  
  To get around this the framework helps the application-writer out by 
maintaining a special '''${mapred.output.dir}/_${taskid}''' sub-dir for each 
task-attempt on hdfs where the output of the reduce task-attempt goes. On 
successful completion of the task-attempt the files in the 
${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to 
${mapred.output.dir}. Of course, the framework discards the sub-directory of 
unsuccessful task-attempts. This is completely transparent to the application.
@@ -125, +121 @@

  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 
reduces) since output of the map, in that case, goes directly to hdfs.
  
- <<BR>> <<Anchor(2.4)>> '''4. [[#A2.4|How do I get each of my maps to work on 
one complete input-file and not allow the framework to split-up my files?]]'''
+ == How do I get each of my maps to work on one complete input-file and not 
allow the framework to split-up my files? ==
  
  Essentially a job's input is represented by the 
[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base
 class).
  
@@ -137, +133 @@

  
  The other, quick-fix option, is to set 
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size|mapred.min.split.size]]
 to large enough value.
  
- <<BR>> <<Anchor(2.5)>> '''5. [[#A2.5|Why I do see broken images in 
jobdetails.jsp page?]]'''
+ == Why I do see broken images in jobdetails.jsp page? ==
  
  In hadoop-0.15, Map / Reduce task completion graphics are added. The graphs 
are produced as SVG(Scalable Vector Graphics) images, which are basically xml 
files, embedded in html content. The graphics are tested successfully in 
Firefox 2 on Ubuntu and MAC OS. However for other browsers, one should install 
an additional plugin to the browser to see the SVG images. Adobe's SVG Viewer 
can be found at http://www.adobe.com/svg/viewer/install/.
  
- <<BR>> <<Anchor(2.6)>> '''6. [[#A2.6|I see a maximum of 2 maps/reduces 
spawned concurrently on each TaskTracker, how do I increase that?]]'''
+ == I see a maximum of 2 maps/reduces spawned concurrently on each 
TaskTracker, how do I increase that? ==
  
  Use the configuration knob: 
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum|mapred.tasktracker.map.tasks.maximum]]
 and 
[[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.reduce.tasks.maximum|mapred.tasktracker.reduce.tasks.maximum]]
 to control the number of maps/reduces spawned simultaneously on a 
!TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 
maps and 2 reduces at a given instance on a !TaskTracker.
  
  You can set those on a per-tasktracker basis to accurately reflect your 
hardware (i.e. set those to higher nos. on a beefier tasktracker etc.).
  
- <<BR>> <<Anchor(2.7)>> '''7. [[#A2.7|Submitting map/reduce jobs as a 
different user doesn't work.]]'''
+ == Submitting map/reduce jobs as a different user doesn't work. ==
  
  The problem is that you haven't configured your map/reduce system   directory 
to a fixed value. The default works for single node systems, but not for   
"real" clusters. I like to use:
  
@@ -159, +155 @@

     </description>
  </property>
  }}}
- Note that this directory is in your default file system and must be   
accessible from both the client and server machines and is typically   in HDFS.
+ Note that this directory is in your default file system and must be   
accessible from both the client and server machines and is typically in HDFS.
  
- <<BR>> <<Anchor(2.8)>> '''8. [[#A2.8|How do Map/Reduce InputSplit's handle 
record boundaries correctly?]]'''
+ == How do Map/Reduce InputSplit's handle record boundaries correctly? ==
  
  It is the responsibility of the InputSplit's RecordReader to start and end at 
a record boundary. For SequenceFile's every 2k bytes has a 20 bytes '''sync''' 
mark between the records. These sync marks allow the RecordReader to seek to 
the start of the InputSplit, which contains a file, offset and length and find 
the first sync mark after the start of the split. The RecordReader continues 
processing records until it reaches the first sync mark after the end of the 
split. The first split of each file naturally starts immediately and not after 
the first sync mark. In this way, it is guaranteed that each record will be 
processed by exactly one mapper.
  
  Text files are handled similarly, using newlines instead of sync marks.
  
- <<BR>> <<Anchor(2.9)>> '''9. [[#A2.9|How do I change final output file name 
with the desired name rather than in partitions like part-00000, part-00001 
?]]'''
+ == How do I change final output file name with the desired name rather than 
in partitions like part-00000, part-00001? ==
  
  You can subclass the 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/OutputFormat.java?view=markup|OutputFormat.java]]
 class and write your own. You can look at the code of 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java?view=markup|TextOutputFormat]]
 
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MultipleOutputFormat.java?view=markup|MultipleOutputFormat.java]]
 etc. for reference. It might be the case that you only need to do minor 
changes to any of the existing Output Format classes. To do that you can just 
subclass that class and override the methods you need to change.
  
- <<BR>> <<Anchor(2.10)>> ''10. [[#A2.10|When writing a New InputFormat, what 
is the format for the array of string returned by 
InputSplit\#getLocations()?]]''
+ == When writing a New InputFormat, what is the format for the array of string 
returned by InputSplit\#getLocations()? ==
  
  It appears that DatanodeID.getHost() is the standard place to retrieve this 
name, and the machineName variable, populated in DataNode.java\#startDataNode, 
is where the name is first set. The first method attempted is to get 
"slave.host.name" from the configuration; if that is not available, 
DNS.getDefaultHost is used instead.
  
- <<BR>> <<Anchor(2.11)>> '''11. [[#A2.11|How do you gracefully stop a running 
job?]]'''
+ == How do you gracefully stop a running job? ==
  
+ {{{
  hadoop job -kill JOBID
+ }}}
  
+ = HDFS =
- 
- <<BR>> <<Anchor(3)>> [[#A3|HDFS]]
  
  <<BR>> <<Anchor(3.1)>> '''1. [[#A3.1|If I add new data-nodes to the cluster 
will HDFS move the blocks to the newly added nodes in order to balance disk 
space utilization between the nodes?]]'''

[Hadoop Wiki] Trivial Update of "TestFaqPage" by SomeOt herAccount

Reply via email to