Hello,

I currently have nutch running on hadoop.  However, for one specific crawl,
I would like to store the data on a local machine instead of putting it on
hadoop.

I basically modified the crawl.java to change the filesystem to local.  
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
FileSystem localFs = FileSystem.getNamed("local", conf);                        
JobConf job = new NutchJob(localFs.getConf());

Path dir = new Path(some_local_path_on_the_machine);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path rootURL = new Path(local_path_on_the_machine);

Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
Indexer indexer = new Indexer(conf);
DeleteDuplicates dedup = new DeleteDuplicates(conf);
IndexMerger merger = new IndexMerger(conf);
                                        
// initialize crawlDb
injector.inject(crawlDb, rootURL);
and so on... 

I keep getting 
Injector: starting
Injector: crawlDb: crawl_db path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Connection refused

or 

Injector: starting
Injector: crawlDb: crawldb path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Input path doesnt exist : url path

However, the url path does exist.  

Can someone give me pointers as to what's going on?  Or perhaps give me
pointers on how to store data on a local machine?  I am not sure if this is
the correct way of putting the data on the local machine.  

Thank you very much.

Hanna
-- 
View this message in context: 
http://www.nabble.com/Nutch-Crawl-tf3717311.html#a10399449
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to