I also got the error by executing the command "$ bin/nutch inject
crawl/crawldb dmoz" :
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: dmoz
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Injector.main(Injector.java:164)
I found the error message in the logs folder.
But when I run the same source code in my windows eclipse environment, I got
no errors and no logs folders. The result crawl folder has the right five
subfolders
I copy the *nutch.war to the Tomcat webapps ROOT directory. And provide the
above crawl folder path in the nutch-site.xml under the WEB-INF/classes
folder.
The tomcat nutch does not work.
Any feedback will be much appreciated!!
Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Tsengtan A Shuy [mailto:[EMAIL PROTECTED]
Sent: Monday, July 02, 2007 2:34 PM
To: '[EMAIL PROTECTED]'
Subject: generate command fails in cygwin environment.
I follow the web page
"http://lucene.apache.org/nutch/tutorial8.html#Intranet%3A+Running+the+Crawl
", and execute the "$ bin/nutch generate crawl/crawldb crawl/segments"
command in my cygwin environment.
I got the following error message:
Generator: starting
Generator: segment: crawl/segments/20070702142541
Generator: Selecting best-scoring urls due for fetch.
Exception in thread "main" java.io.IOException: Input directory
E:/cygwin/home/A
dministrator/nutch-0.8.1/crawl/crawldb/current in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Generator.generate(Generator.java:319)
at org.apache.nutch.crawl.Generator.main(Generator.java:395)
Do you know how to solve the problem?
Your any feedback will be much appreciated.
Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Chris Hane [mailto:[EMAIL PROTECTED]
Sent: Monday, July 02, 2007 12:45 PM
To: [EMAIL PROTECTED]
Subject: Re: Adding meta data to searched documents
Enis - thanks for the pointer.
Enis Soztutar wrote:
> You can write index plugins. Please first read the (slighlty outdated)
> tutorial and then check http://wiki.apache.org/nutch/PluginCentral.
> Optionally you may want to write html parse plugins depending on the
> source of the data.
>
> Chris Hane wrote:
>> I am looking to use nutch to crawl/index a website. A lot of the
>> pages have videos on them. We have transcripts for the videos that we
>> would like to be included for indexing; but we do not want to put the
>> transcripts on the web pages.
>>
>> Is there a way to "add" this information to a given web page for
>> purposes of indexing as part of the crawl process? Maybe another
>> point in the process before the index is generated? I am hoping there
>> is a point in the crawl process where I can add augmented content to a
>> page in the nutch segment (rough thought based on very limited time
>> spent looking at nutch).
>>
>> We are comfortable using java and can write custom code as needed. I
>> would appreciate any pointers on where to look in the nutch code.
>>
>> Thanks in advance,
>> Chris.....
>>
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general