Hello,
I currently have nutch running on hadoop. However, for one specific crawl,
I would like to store the data on a local machine instead of putting it on
hadoop.
I basically modified the crawl.java to change the filesystem to local.
Configuration conf = NutchConfiguration.create();
conf.addDefaultResource("crawl-tool.xml");
FileSystem localFs = FileSystem.getNamed("local", conf);
JobConf job = new NutchJob(localFs.getConf());
Path dir = new Path(some_local_path_on_the_machine);
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indexes = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path rootURL = new Path(local_path_on_the_machine);
Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
Fetcher fetcher = new Fetcher(conf);
ParseSegment parseSegment = new ParseSegment(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
Indexer indexer = new Indexer(conf);
DeleteDuplicates dedup = new DeleteDuplicates(conf);
IndexMerger merger = new IndexMerger(conf);
// initialize crawlDb
injector.inject(crawlDb, rootURL);
and so on...
I keep getting
Injector: starting
Injector: crawlDb: crawl_db path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Connection refused
or
Injector: starting
Injector: crawlDb: crawldb path
Injector: urlDir: url path
Injector: Converting injected urls to crawl db entries.
Input path doesnt exist : url path
However, the url path does exist.
Can someone give me pointers as to what's going on? Or perhaps give me
pointers on how to store data on a local machine? I am not sure if this is
the correct way of putting the data on the local machine.
Thank you very much.
Hanna
--
View this message in context:
http://www.nabble.com/Nutch-Crawl-tf3717311.html#a10399449
Sent from the Nutch - User mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general