It seems to me that you may have the same problem as before with the disk space. This may happen because you do mergesegs. Try not to merge segments.
Alex. -----Original Message----- From: McGibbney, Lewis John <lewis.mcgibb...@gcu.ac.uk> To: user <user@nutch.apache.org> Sent: Wed, Apr 6, 2011 12:55 pm Subject: Script failing when arriving at 'Solr' commands Hi list, The last week has been a real hang up and I have made very little progress so excuse this lengthy post. Using branch-1.3. My script contains following commands 1.inject 2.generate fetch parse updatedb 3.mergesegs 4.inverlinks 5.solrindex 6.solrdedup 7.solrclean 8.load new index The script is running fine until solrindex stage and this output LinkDb: starting at 2011-04-06 20:25:40 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/segments/20110406202533 LinkDb: merging with existing linkdb: crawl/linkdb LinkDb: finished at 2011-04-06 20:25:44, elapsed: 00:00:03 ----- SolrIndex (Step 5 of 8) ----- SolrIndexer: starting at 2011-04-06 20:25:45 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_text Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/NEWindexes/current ----- SolrDedup (Step 6 of 8) ----- Usage: SolrDeleteDuplicates <solr url> ----- SolrClean (Step 7 of 8) ----- SolrClean: starting at 2011-04-06 20:25:47 Exception in thread "main" java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:169) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrClean.delete(SolrClean.java:168) at org.apache.nutch.indexer.solr.SolrClean.run(SolrClean.java:180) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrClean.main(SolrClean.java:186) Having inspected the linkdb I can see a directory named 'current', which in turn contains a 'part-00000' directory which contains two files named 'data' and 'index'... as far as I am aware this is identical when I used Nutch-1.2. A couple of points to note about my recent discovery's and thought's 1. I was having problems with script with a similar 'Input path does not exist' error until I added the hadoop.tmp.dir property as a HDD partition to nutch-site, this seemed to solve the problem. 2. I am aware that it is maybe not necessary (and possibly not best practice for some situations) to include an invertlinks command prior to indexing, however this has always been my practice and has always provided great results when I was using the legacy Lucene indexing within Nutch-1.2, therefore I am curious to understand if it is this command which is knocking off the solrindexer 3.Is it a possibility that there is a similar property such as solr.tmp.dir I need to set which I am missing and this is knocking solrindexer off? 4. Even after solrindexer kicks in, the solrdedup output does not appear to be responding correctly, this is shadowed by solrclean so I am definitely doing something wrong here, however I am unfamiliar with the IOException No FileSystem for scheme: http. I understand that this post may seem a bit epic, but from the information I have E.g. logs, terminal output and user-lists I am stumped. I'm therefore looking for guys with more experience to possibly lend a hand. I can provide additional command parameters if this is of value. Thanks in advance for any help Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html