Hi,
Watching your website I can see two kind of different results:

-For example the first hit
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf, has no summary and
it produces the problem with cache.

-The third hit belongs to  the second  group,  they have summary and the
cache link goes fine.

So it looks like nutch cant access the content of first groupt hits. Maybe
parse-pdf plugin cant handle this pdf, it could happen, this would also
explains why the title of the first group hits is the URL, and not the title
keep inside pdf document.

If I were you I would crawl only the first hit (
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf ), and look the log
file.  If  parse-pdf  cant handle this document you will see a big ERROR
message.

Hope it helps.

Alvaro C.

2006/9/14, Jacob Brunson <[EMAIL PROTECTED]>:

>
> I don't know if I understand completely your email.
> What you mean with "cache"?

So if you go with the standard search results page, there is a link to
a cached copy of the page.  If the page was html, then there are no
problems, however, if the page was binary, it returns a http 500
internal server error.

You can see this if you click on the "cached" link of any of the pdf
documents in the search results on my search engine:
http://ldssearch.com/search.jsp?lang=en&query=pdf


>
> steven shingler escribió:
> > Hi all,
> >
> > I'm trying to find out which filetypes nutch will cache.
> >
> > for example: it does html, but not pdf.
> >
> > Is there any documentation on how different filetypes are handled?
> >
> > Is it possible to configure nutch to cache pdfs etc?
> >
> > Any advice very gratefully received.
> > Thanks,
> > Steve
> >
> >
------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date:
11/09/2006
> >
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
>
>


--
http://JacobBrunson.com

Reply via email to