Hi Lewis,
> Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an
> update of Julien's (I think) page on GORA_HBase. Thsi will get you
> rocking with HBase. The changes between Cassandra, Accumulo and the
> other data stores are fairly trivial.
I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
Much simpler than 1.x (no segments!).
Below a couple of problems I've run into (possible issues to be adressed in
2.1).
Cheers,
Sebastian
% ./bin/nutch readdb -stats
WebTable statistics start
WebTableReader: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
at
org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
--> readdb -dump works.
% ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
Exception in thread "main" java.lang.IllegalArgumentException: arg -parse not
recognized
% ./bin/nutch parse -all -force -resume
ParserJob: starting
ParserJob: resuming: false <<< -resume and
ParserJob: forced reparse: false <<< -force obviously ignored ?
ParserJob: parsing all
% ./bin/nutch generate
--> generates batchid, but should show help as in 1.x ?
--> is there an option -topN ?
The 2.0 Solr schema and mappings still contain the field "site"
which has been removed in 1.x (NUTCH-1232).
Should be done also in 2.0: it's easier to maintain only one Solr installation
for all Nutch versions.