Hi Patrick, Thank you for your advice.
my nutch-site.xml file is already set as you said and I can search pdf file under other urls. Just the file under the url I said before can not be indexed . I guess maybe It is about the type of urls. Because from log we can see it was fetched but not indexed. anybody can help me? regards, Gong Zhao 2008/7/24 Patrick Markiewicz <[EMAIL PROTECTED]>: > Hi Gong Zhao, > Make sure you have the parse-pdf plugin enabled in your > nutch-site.xml file. > I.e. > <property> > <name>plugin.includes</name> > <value>...|parse-(xml|text|html|js|pdf)|...</value> > <description> > </description> > </property> > > That's the only thing I can think of at first glance. > > Patrick > -----Original Message----- > From: 宫照 [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 23, 2008 11:27 PM > To: [email protected] > Subject: nutch fetched but no indexed > > Hi everybody, > > I face a problem when using nutch. I use nuth to crawl in intranet. It > works > well before. But recently, I add some urls to crawl. These urls ara > different with normal .The new urls like this: > http://compass.mydomain.com/go/247460034 > > there are many folders or documents under this url, such as folder: > http://compass.mot.com/go/247460034/2354342276 > documents: > http://compass.mot.com/go/247460034/mydoc.pdf > > After crawl, the docs under this kind of urls can not be searched, > I check the log, I find when crawling this kind of urls can be fetched > ,but > they were not indexed. > > I don't know why. Can you tell how to do? > > regards, > > Gong Zhao >
