Earl Cahill wrote:

Guess I figured as much.  Can I suggest that someone
typing
bin/nutch admin ...

in the mappred branch, should get pointed to the
proper command, or at least a message saying that

There is no separate command - for now the DB is created when you run Injector or Crawl (which calls Injector as the first step). Other commands from the script should work very similarly, even though they use now different implementations:

* inject - runs Injector to add urls from a plaintext file (one url per line, there may be many input files, and they must be placed inside a directory). This creates the CrawlDB in the destination directory if it didn't exist before, or updates the existing one. Note that the new CrawlDB does NOT contain links - they are stored separately in a LinkDB, and CrawlDB just stores the equivalents of Page in the former WebDB.

* generate - runs Generate to create new fetchlists to be fetched

* fetch - runs the modified Fetcher to fetch segments

* updatedb - runs CrawlDB.update() to update the CrawlDB with new page information, and to add new unfetched pages.

* invertlinks - creates or updates a LinkDB, containing incoming link information. Note that it takes as an argument the top level dir, where the new segments are contained, and not the dir names of segments...

* index - runs the new modified Indexer to create an index of the fetched segments.

The above commands read the mapred configuration, and for now it defaults to "local", which means that all Jobs execute within the same JVM, and NDFS also defaults to local. The rest of the commands in bin/nutch have to do with a distributed setup.

admin doesn't exist in the mapred branch, just to save
some confusion.  There is a dumb patch below that
would change the usage line.

I think such differences are all the more reason to
have a nice mapred tutorial, which I would be more
than willing to help with.  I thought I was close, but

Yes, I agree. But there are still some command-line tools missing, or not yet ported to use mapred. At this point a general tutorial would be difficult... unless it would be simply "you need to run ./nutch crawl" ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to