As far as I can see they are separate. The tutorials are clearly under
different subsections, and the Nutch 2.x docs have their own section
as well.
I made quick review of documentation. Details below:
http://wiki.apache.org/nutch/CommandLineOptions
Webgraph classes - Present in the docs but do not exist in Nuch 2.
Other Classes CrawlDBScanner - Present in the docs but do not exist
in Nuch 2.
http://wiki.apache.org/nutch/NutchConfigurationFiles
Mentioned files which doesn't exists in Nutch 2:
hadoop-site.xml
job.xml
mapred-default.xml
http://wiki.apache.org/nutch/IndexStructure
Refers to index-extra and index-static plugins but they aren't
available at Nuch 2
http://wiki.apache.org/nutch/SetupProxyForNutch
Configure Nutch (Nutch 1.3)
Title suggest that this section refers to Nutch 1. x only but I
think this prop also exists in Nutch 2.
Nutch 2.x:
http://wiki.apache.org/nutch/Nutch2Architecture
As mentioned in headline document is outdated. Maybe should be removed?
http://wiki.apache.org/nutch/NewScoring - invalid in Nutch 2.1
http://wiki.apache.org/nutch/NewScoringIndexingExample - invalid in
Nutch 2.1
Now I see that in Nutch 2.x section some pages are equivalents of
pages in "Configuration" section.
Does this mean that the content in "Configuration" refers only to
Nuch 1.x? I'm not sure because some pages from "Configuration" do
not appear in Nutch 2.x but seems valid for branch 2.
Release Report - http://s.apache.org/PGa
I did not notice this. Division into bugs and Improvements is very nice
If new user looks at Nuch he will not check the changelog but
documentation.
Is this your opinion or are you commenting from a wider audiences
perspective?
Only my personal opinion and experience.
I think the new user should be provided with clear information
about which branch to choose.
I agree with this. This is why the lists exist. You can ask questions.
You can also read some archives. It takes a minimal well spent
investment of time to dig up what other have asked many many times.
Don't get me wrong, I am all for informing people about the
software... however I am not in the immediate position to write a
decent quality book on Nutch which would do the community and software
justice. If you are then please do.
If I'll get enough mana with Nutch I will try :)
What is more, the doc should be divided in branch 1 and 2.
Please see the table of contents on the wiki. Please also see my
comments above.
As I described above, in my opinion docs are mixed.
Pages could link together, but there should be a clean branch tree
in the docs. As like in source code. You do not mix packages from
two branches but you keep them in separated repos.
ditto
Later I will try to propose some documentation structure on users
mailing list.
I don't think that for bugs documentation is essential. Only for
new features or refactoring. It doesn't have to be big document.
It just has to exist.
But what happens if fixing a bug changes functionality? Then what?
If feature is documented on the wiki and while fixing it developer
change its behaviour, the doc should be updated or at least marked as
outdated.
How else could it be done? If I read doc I should check its last
modification date and compare it with issues dates related to it?
Essential here is good wiki structure. It should enable developer
quickly identify which pages are related to the issue.
I know that sometimes developers don't have time to create
documentation. But in such case they should create a new task for
such doc. Otherwise nobody knows that doc is missing and cannot help.
Not true. All you need to do is request Karma for the project wiki and
you can contribute to whatever you feel is missing. I don't take this
argument sorry.
Is there any wiki todo list? I know that some pages are marked to be
cleaned up. But what with pages that should be created from scratch?
Do you think jira documentation component could be used for that?
(https://issues.apache.org/jira/issues?jql=project%20%3D%20NUTCH%20AND%20component%20%3D%20documentation)
Maybe we should mention about this path on wiki?
I am not saying that confluence is best for this project. But in
my opinion Nutch docs should be moved to some community/social
solutions. It would be great if it enables comments and pull
requests (like on github) to improve it.
AFAICT the wiki we currently have IS community oriented. Anyone over
the years that has wished to add/edit has been granted Karma to do so.
Are you really saying that enabling pull request via Github is a
better way than simply granting someone Karma to edit a page as they wish?
I think yes. I believe that such approach is useful for people who
encountered problems with some specific part of Nutch but do not want to
contribute continuously.
I'm thinking about simplifying such scenario "Hey this doc is wrong. I
will send pull request (jira issue) with fixes". I my opinion in this
kind of situations, people will not want to subscribe mailing list and
ask about access to doc editing. This also creates the possibility to
review such request.
Honestly I haven't seen anything from your commentary which would
suggest benefits for Nutch as a whole... I am trying NOT to be
pessimistic, but I am just struggling to see your point here.
If the wiki is outdated... then we should update it. Not change to
another solution just so we can receive pull requests for documentation.
There is an argument to make it as easy as possible to contribute
documentation to Nutch. However as far as I can see, there are not
crowds of people rushing to contribute.
Please don't take these comments negatively. I am behind any motion to
make documentation better. I just don't see eye-to-eye with some of
your points.
I believe that some people would like to contribute some small pieces of
the doc, but if the process is too complicated they are too lazy to do
it. It is normal for our kind. We were lazy so we have created computers
and then crawlers :)