Re: nutch fetched but no indexed

wuqi Thu, 24 Jul 2008 19:54:05 -0700

This problem can't be figured out just with a simple command.Just a few points 
hope helpfull for you.


1. Why you think the page is not indexed ? just can't be searched ? You can use 
Lucene index tool Luke to find whether the page is in index.
2.If this page is not in the index,try to check the status of this page in 
crawldb,if it is db_fetched, then try to check wheter it exist in the segement 
file..



----- Original Message ----- 
From: "宫照" <[EMAIL PROTECTED]>
To: <[email protected]>; <[EMAIL PROTECTED]>
Sent: Friday, July 25, 2008 9:53 AM
Subject: Re: nutch fetched but no indexed


> Hi Patrick，
> 
> Thank you for your advice.
> 
> my nutch-site.xml file is already set as you said  and I can search pdf file
> under other urls.
> 
> Just the file under the url I said before can not be indexed .
> 
> I guess maybe It is about the type of urls. Because from log we can see it
> was fetched but not indexed.
> 
> anybody can help me?
> 
> regards,
> 
> Gong Zhao
> 
> 
> 
> 2008/7/24 Patrick Markiewicz <[EMAIL PROTECTED]>:
> 
>> Hi Gong Zhao,
>>        Make sure you have the parse-pdf plugin enabled in your
>> nutch-site.xml file.
>> I.e.
>> <property>
>>  <name>plugin.includes</name>
>>  <value>...|parse-(xml|text|html|js|pdf)|...</value>
>>  <description>
>>  </description>
>> </property>
>>
>> That's the only thing I can think of at first glance.
>>
>> Patrick
>> -----Original Message-----
>> From: 宫照 [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, July 23, 2008 11:27 PM
>> To: [email protected]
>> Subject: nutch fetched but no indexed
>>
>> Hi everybody，
>>
>> I face a problem when using nutch. I use nuth to crawl in intranet. It
>> works
>> well before. But recently, I add some urls to crawl. These urls ara
>> different with normal .The new urls like this:
>> http://compass.mydomain.com/go/247460034
>>
>> there are many folders or documents under this url, such as folder:
>> http://compass.mot.com/go/247460034/2354342276
>> documents:
>> http://compass.mot.com/go/247460034/mydoc.pdf
>>
>> After crawl, the docs under this kind of urls can not be searched,
>> I check the log, I find when crawling  this kind of urls can be fetched
>> ,but
>> they were not indexed.
>>
>> I don't know why. Can you tell how to do?
>>
>> regards,
>>
>> Gong Zhao
>>
>

Re: nutch fetched but no indexed

Reply via email to