Hi Gong Zhao,
Make sure you have the parse-pdf plugin enabled in your nutch-site.xml
file.
I.e.
<property>
<name>plugin.includes</name>
<value>...|parse-(xml|text|html|js|pdf)|...</value>
<description>
</description>
</property>
That's the only thing I can think of at first glance.
Patrick
-----Original Message-----
From: 宫照 [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 23, 2008 11:27 PM
To: [email protected]
Subject: nutch fetched but no indexed
Hi everybody,
I face a problem when using nutch. I use nuth to crawl in intranet. It works
well before. But recently, I add some urls to crawl. These urls ara
different with normal .The new urls like this:
http://compass.mydomain.com/go/247460034
there are many folders or documents under this url, such as folder:
http://compass.mot.com/go/247460034/2354342276
documents:
http://compass.mot.com/go/247460034/mydoc.pdf
After crawl, the docs under this kind of urls can not be searched,
I check the log, I find when crawling this kind of urls can be fetched ,but
they were not indexed.
I don't know why. Can you tell how to do?
regards,
Gong Zhao