[
https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899815#action_12899815
]
Andrzej Bialecki commented on NUTCH-881:
-----------------------------------------
bq. So what is new in Nutch 2.0 which doesn't appear in Nutch 1.x ? Gora is the
main thing which comes to mind.
Yes. We also removed all search-related code from Nutch and rely exclusively on
Solr to perform searching. This means that some APIs have been removed (e.g.
query filters, text analysis, lucene indexing backend).
bq. How do the config files differ?
We still use the same nutch-default/nutch-site.xml, plus per-plugin config
files. Some properties have changes, e.g the ones to limit max. number of urls
per host in generator. We added some Gora-related files, gora.properties and
gora-*-mapping.xml, that define what driver to use and how to map webtable
columns onto storage-specific columns/fields.
bq. How does Nutch's use of Hadoop differ?
All jobs now use GoraInputFormat / GoraOutputFormat, which hides the details
about the actual data storage backend.
bq. How do the command lines differ? (Presumably you need different command
lines to say where to store the crawldb, right?)
Yes. Actually, this could be a separate issue to be solved - currently we
assume there is one Nutch webtable per storage backend, so we don't specify the
"db identifier" anywhere... but this prevents us from defining multiple crawl
configs that use the same backend, so it should be addressed.
> Good quality documentation for Nutch
> ------------------------------------
>
> Key: NUTCH-881
> URL: https://issues.apache.org/jira/browse/NUTCH-881
> Project: Nutch
> Issue Type: Improvement
> Components: documentation
> Affects Versions: 2.0
> Reporter: Andrzej Bialecki
>
> This is, and has been, a long standing request from Nutch users. This becomes
> an acute need as we redesign Nutch 2.0, because the collective knowledge and
> the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the
> Wiki is good for casual information and recipes but I think it's too messy
> and not reliable enough as a reference.
> I propose to start with the following:
> 1. let's decide on the format of the docs. Each format has its own pros and
> cons:
> * HTML: easy to work with, but formatting may be messy unless we edit it by
> hand, at which point it's no longer so easy... Good toolchains to convert to
> other formats, but limited expressiveness of larger structures (e.g. book,
> chapters, TOC, multi-column layouts, etc).
> * Docbook: learning curve is higher, but not insurmountable... Naturally
> yields very good structure. Figures/diagrams may be problematic - different
> renderers (html, pdf) like to treat the scaling and placing somewhat
> differently.
> * Wiki-style (Confluence or TWiki): easy to use, but limited control over
> larger structures. Maven Doxia can format cwiki, twiki, and a host of other
> formats to e.g. html and pdf.
> * other?
> 2. start documenting the main tools and the main APIs (e.g. the plugins and
> all the extension points). We can of course reuse material from the Wiki and
> from various presentations (e.g. the ApacheCon slides).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.