[
https://issues.apache.org/jira/browse/NUTCH-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2531:
-----------------------------------
Fix Version/s: 2.5
> Unclear steps in Nutch2 Tutorial
> --------------------------------
>
> Key: NUTCH-2531
> URL: https://issues.apache.org/jira/browse/NUTCH-2531
> Project: Nutch
> Issue Type: Improvement
> Reporter: Krzysztof Madejski
> Priority: Minor
> Fix For: 2.5
>
>
> I was trying to install Nutch based on this tutorial
> [https://wiki.apache.org/nutch/Nutch2Tutorial:]
>
> Issues I've found:
> In Obtaining Software and Configuration:
> # _"Specify the [...] along with all of the other Configuration options
> suggested within the [Nutch 1.x
> tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
> It would be better to copy necessary configuration. I don't have idea which
> settings exactly should be copied.
> 2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive
> dependency, this is a bug in gora-hbase 0.6.1 as described
> [here|https://github.com/apache/gora/pull/21]. This bug is removed in current
> Gora development."_
> __ What does this step require from me? Should I add something to the
> dependencies? In which file? This point is written in an informative manner.
> Should be either deleted or clear instruction should be given.
> 3. _"*N.B.* It's probably worth checking and setting all your usual
> configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before
> progressing."_
> I'ts my first install. There is no such thing as "usual configuration"..
> In "Invoke Nutch":
> # "nutch readdb" doesn't return anything meaningful apart from Usage.
> ./nutch readdb
> Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
> [-crawlId <id>] [-content] [-headers] [-links] [-text]
> -crawlId <id> - the id to prefix the schemas to operate on,
> (default: storage.crawl.id)
> -stats [-sort] - print overall statistics to System.out
> [-sort] - list status sorted by host
> -url <url> - print information on <url> to System.out
> -dump <out_dir> [-regex regex] - dump the webtable to a text file in
> <out_dir>
> -content - dump also raw content
> -headers - dump protocol headers
> -links - dump links
> -text - dump extracted text
> [-regex] - filter on the URL of the webtable entry
--
This message was sent by Atlassian Jira
(v8.3.4#803005)