Krzysztof Madejski created NUTCH-2531:
-----------------------------------------

             Summary: Unclear steps in Nutch2 Tutorial
                 Key: NUTCH-2531
                 URL: https://issues.apache.org/jira/browse/NUTCH-2531
             Project: Nutch
          Issue Type: Improvement
            Reporter: Krzysztof Madejski


I was trying to install Nutch based on this tutorial 
[https://wiki.apache.org/nutch/Nutch2Tutorial:]

 

Issues I've found:

In Obtaining Software and Configuration:
 # _"Specify the [...] along with all of the other Configuration options 
suggested within the [Nutch 1.x 
tutorial|http://wiki.apache.org/nutch/NutchTutorial]."_
  It would be better to copy necessary configuration. I don't have idea which 
settings exactly should be copied.

2. _"In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive 
dependency, this is a bug in gora-hbase 0.6.1 as described 
[here|https://github.com/apache/gora/pull/21]. This bug is removed in current 
Gora development."_
  __  What does this step require from me? Should I add something to the 
dependencies? In which file? This point is written in an informative manner. 
Should be either deleted or clear instruction should be given.

3. _"*N.B.* It's probably worth checking and setting all your usual 
configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before 
progressing."_
   I'ts my first install. There is no such thing as "usual configuration"..

In "Invoke Nutch":
 # "nutch readdb" doesn't return anything meaningful apart from Usage. 
./nutch readdb
Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex]) 
 [-crawlId <id>] [-content] [-headers] [-links] [-text]
 -crawlId <id> - the id to prefix the schemas to operate on, 
 (default: storage.crawl.id)
 -stats [-sort] - print overall statistics to System.out
 [-sort] - list status sorted by host
 -url <url> - print information on <url> to System.out
 -dump <out_dir> [-regex regex] - dump the webtable to a text file in 
 <out_dir>
 -content - dump also raw content
 -headers - dump protocol headers
 -links - dump links
 -text - dump extracted text
 [-regex] - filter on the URL of the webtable entry



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to