[ 
https://issues.apache.org/jira/browse/HADOOP-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535725
 ] 

Doug Cutting commented on HADOOP-2046:
--------------------------------------

Overall this looks great.  A few comments:

- In Configuration.java, the first use of 'final' should be in italics, not 
bold, and the anchors in the headers should be done with <h4 id=foo>Foo</h4>.  
I also find the links to String and Path mostly just introduce noise.  We might 
make the first reference to Path a link, but leave the rest as plain text: no 
one is going to click on that link to find out what a Java String is, nor do we 
need more than a single link to Path.

- In JobClient.java, the anchors should be implemented with 'id='.  We should 
not mention HDFS here: the system directory could be in, e.g., KFS.  I would 
also leave the internally used file names "job.jar" and "job.xml" out of this 
description.  The list of things done should include 'submission of the job to 
the jobtracker'.  The steps you list are all preparations for that, but we 
don't want to forget that crucial step.  In the list of ways to handle job 
sequencing, it should be made more clear that these are alternatives: one 
should choose just one method.  Also, should we mention the jobcontrol stuff 
here?

- in JobConf.java: the JobConf isn't XML.  It can be serialized as XML, but 
it's fundamentally a Map<String,String>, a Configuration.  We also have anchors 
that should use 'id=' here, and mentions of HDFS that should be instead just be 
to FileSystem (all FileSystem's have a block size, that's used to generate 
splits).  And, instead of 'default InputFormat' we should say 'standard 
file-based InputFormats'.  We should probably also include something at the 
top-level in this class about the determination of job jar file.



> Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-2046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2046
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.14.2
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2046_1_20071018.patch
>
>
> I'd like to put forward some thoughts on how to structure reasonably detailed 
> documentation for hadoop.
> Essentially I think of atleast 3 different profiles to target:
> * hadoop-dev, folks who are actively involved improving/fixing hadoop.
> * hadoop-user
> ** mapred application writers and/or folks who directly use hdfs
> ** hadoop cluster administrators
> For this issue, I'd like to first target the latter category (admin and 
> hdfs/mapred user) - where, arguably, is the biggest bang for the buck, right 
> now. 
> There is a crying need to get user-level stuff documented, judging by the 
> sheer no. of emails we get on the hadoop lists...
> ----
> *1. Installing/Configuration Guides*
> This set of documents caters to folks ranging from someone just playing with 
> hadoop on a single-node to operations teams who administer hadoop on several 
> nodes (thousands). To ensure we cover all bases I'm thinking along the lines 
> of:
> * _Download, install and configure hadoop_ on a single-node cluster: 
> including a few comments on how to run examples (word-count) etc.
> * *Admin Guide*: Install and configure a real, distributed cluster. 
> * *Tune Hadoop*: Separate sections on how to tune hdfs and map-reduce, 
> targeting power admins/users.
> I reckon most of this would be done via forrest, with appropriate links to 
> javadoc.
> ---
> *2. User Manual*
> This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to 
> document:
> * Write a really simple mapred application, just fitting the blocks together 
> i.e. maybe a walk-through of a couple of examples like word-count, sort etc.
> * Detailed information on important map-reduce user-interfaces:
> *- JobConf
> *- JobClient
> *- Tool & ToolRunner
> *- InputFormat 
> *-- InputSplit
> *-- RecordReader
> *- Mapper
> *- Reducer
> *- Reporter
> *- OutputCollector
> *- Writable
> *- WritableComparable
> *- OutputFormat
> *- DistributedCache
> * SequenceFile
> *- Compression types: NONE, RECORD, BLOCK
> * Hadoop Streaming
> * Hadoop Pipes
> I reckon most of this would land up in the javadocs, specifically 
> package.html and some via forrest.
> ----
> Also, as discussed in HADOOP-1881, it would be quite useful to maintain 
> documentation per-release, even on the hadoop website i.e. we could have a 
> main documentation page link to documentation per-release and to the trunk.
> ----
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to