Hi Patrick,

Thank you for your advice.

my nutch-site.xml file is already set as you said  and I can search pdf file
under other urls.

Just the file under the url I said before can not be indexed .

I guess maybe It is about the type of urls. Because from log we can see it
was fetched but not indexed.

anybody can help me?

regards,

Gong Zhao



2008/7/24 Patrick Markiewicz <[EMAIL PROTECTED]>:

> Hi Gong Zhao,
>        Make sure you have the parse-pdf plugin enabled in your
> nutch-site.xml file.
> I.e.
> <property>
>  <name>plugin.includes</name>
>  <value>...|parse-(xml|text|html|js|pdf)|...</value>
>  <description>
>  </description>
> </property>
>
> That's the only thing I can think of at first glance.
>
> Patrick
> -----Original Message-----
> From: 宫照 [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 23, 2008 11:27 PM
> To: [email protected]
> Subject: nutch fetched but no indexed
>
> Hi everybody,
>
> I face a problem when using nutch. I use nuth to crawl in intranet. It
> works
> well before. But recently, I add some urls to crawl. These urls ara
> different with normal .The new urls like this:
> http://compass.mydomain.com/go/247460034
>
> there are many folders or documents under this url, such as folder:
> http://compass.mot.com/go/247460034/2354342276
> documents:
> http://compass.mot.com/go/247460034/mydoc.pdf
>
> After crawl, the docs under this kind of urls can not be searched,
> I check the log, I find when crawling  this kind of urls can be fetched
> ,but
> they were not indexed.
>
> I don't know why. Can you tell how to do?
>
> regards,
>
> Gong Zhao
>

Reply via email to