Re: nutch fetched but no indexed

宫照 Sun, 27 Jul 2008 23:44:11 -0700

Hi,

Thank you for wuqi's help.


I check it under luke and can not find it.

Now I import the source into eclipse and debug it . I found there is an
exception at here:
org.apache.nutch.indexer.DeleteDuplicates.java   line 439
*JobClient.runJob(job);*

The exception is
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
    at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

It is an exception from hadoop.
*
if i use other urls, it will be ok. The exception only occured for some
special urls.*

anybody know the reasons?

regrads,

Gong zhao



2008/7/25 wuqi <[EMAIL PROTECTED]>

> This problem can't be figured out just with a simple command.Just a few
> points hope helpfull for you.
>
> 1. Why you think the page is not indexed ? just can't be searched ? You can
> use Lucene index tool Luke to find whether the page is in index.
> 2.If this page is not in the index,try to check the status of this page in
> crawldb,if it is db_fetched, then try to check wheter it exist in the
> segement file..
>
>
>
> ----- Original Message -----
> From: "宫照" <[EMAIL PROTECTED]>
> To: <[email protected]>; <[EMAIL PROTECTED]>
> Sent: Friday, July 25, 2008 9:53 AM
> Subject: Re: nutch fetched but no indexed
>
>
> > Hi Patrick，
> >
> > Thank you for your advice.
> >
> > my nutch-site.xml file is already set as you said  and I can search pdf
> file
> > under other urls.
> >
> > Just the file under the url I said before can not be indexed .
> >
> > I guess maybe It is about the type of urls. Because from log we can see
> it
> > was fetched but not indexed.
> >
> > anybody can help me?
> >
> > regards,
> >
> > Gong Zhao
> >
> >
> >
> > 2008/7/24 Patrick Markiewicz <[EMAIL PROTECTED]>:
> >
> >> Hi Gong Zhao,
> >>        Make sure you have the parse-pdf plugin enabled in your
> >> nutch-site.xml file.
> >> I.e.
> >> <property>
> >>  <name>plugin.includes</name>
> >>  <value>...|parse-(xml|text|html|js|pdf)|...</value>
> >>  <description>
> >>  </description>
> >> </property>
> >>
> >> That's the only thing I can think of at first glance.
> >>
> >> Patrick
> >> -----Original Message-----
> >> From: 宫照 [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, July 23, 2008 11:27 PM
> >> To: [email protected]
> >> Subject: nutch fetched but no indexed
> >>
> >> Hi everybody，
> >>
> >> I face a problem when using nutch. I use nuth to crawl in intranet. It
> >> works
> >> well before. But recently, I add some urls to crawl. These urls ara
> >> different with normal .The new urls like this:
> >> http://compass.mydomain.com/go/247460034
> >>
> >> there are many folders or documents under this url, such as folder:
> >> http://compass.mot.com/go/247460034/2354342276
> >> documents:
> >> http://compass.mot.com/go/247460034/mydoc.pdf
> >>
> >> After crawl, the docs under this kind of urls can not be searched,
> >> I check the log, I find when crawling  this kind of urls can be fetched
> >> ,but
> >> they were not indexed.
> >>
> >> I don't know why. Can you tell how to do?
> >>
> >> regards,
> >>
> >> Gong Zhao
> >>
> >
>

Re: nutch fetched but no indexed

Reply via email to