It seems to me that you may have the same problem as before with the disk 
space. This may happen because you do mergesegs. Try not to merge segments.

Alex.

 

 


 

 

-----Original Message-----
From: McGibbney, Lewis John <lewis.mcgibb...@gcu.ac.uk>
To: user <user@nutch.apache.org>
Sent: Wed, Apr 6, 2011 12:55 pm
Subject: Script failing when arriving at 'Solr' commands


Hi list,

The last week has been a real hang up and I have made very little progress so 
excuse this lengthy post. Using branch-1.3. My script contains following 
commands

1.inject
2.generate
   fetch
   parse
   updatedb
3.mergesegs
4.inverlinks
5.solrindex
6.solrdedup
7.solrclean
8.load new index

The script is running fine until solrindex stage and this output

LinkDb: starting at 2011-04-06 20:25:40
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/segments/20110406202533
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2011-04-06 20:25:44, elapsed: 00:00:03
----- SolrIndex (Step 5 of 8) -----
SolrIndexer: starting at 2011-04-06 20:25:45
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_fetch
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_parse
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_data
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_text
Input path does not exist: 
file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/NEWindexes/current
----- SolrDedup (Step 6 of 8) -----
Usage: SolrDeleteDuplicates <solr url>
----- SolrClean (Step 7 of 8) -----
SolrClean: starting at 2011-04-06 20:25:47
Exception in thread "main" java.io.IOException: No FileSystem for scheme: http
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:169)
        at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.indexer.solr.SolrClean.delete(SolrClean.java:168)
        at org.apache.nutch.indexer.solr.SolrClean.run(SolrClean.java:180)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.solr.SolrClean.main(SolrClean.java:186)

Having inspected the linkdb I can see a directory named 'current', which in 
turn 
contains a 'part-00000' directory which contains two files named 'data' and 
'index'... as far as I am aware this is identical when I used Nutch-1.2. A 
couple of points to note about my recent discovery's and thought's

1. I was having problems with script with a similar 'Input path does not exist' 
error until I added the hadoop.tmp.dir property as a HDD partition to 
nutch-site, this seemed to solve the problem.
2. I am aware that it is maybe not necessary (and possibly not best practice 
for 
some situations) to include an invertlinks command prior to indexing, however 
this has always been my practice and has always provided great results when I 
was using the legacy Lucene indexing within Nutch-1.2, therefore I am curious 
to 
understand if it is this command which is knocking off the solrindexer
3.Is it a possibility that there is a similar property such as solr.tmp.dir I 
need to set which I am missing and this is knocking solrindexer off?
4. Even after solrindexer kicks in, the solrdedup output does not appear to be 
responding correctly, this is shadowed by solrclean so I am definitely doing 
something wrong here, however I am unfamiliar with the IOException No 
FileSystem 
for scheme: http.

I understand that this post may seem a bit epic, but from the information I 
have 
E.g. logs, terminal output and user-lists I am stumped. I'm therefore looking 
for guys with more experience to possibly lend a hand. I can provide additional 
command parameters if this is of value.

Thanks in advance for any help Lewis




Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

 

Reply via email to