Hi Steven
I don't know if I understand completely your email.
What you mean with "cache"?
If do you want to crawl pdf's, you need to delete the url filter for that.
In your crawl-urlfilter.txt, do you have a line starting with a minus
and a list of file extensions. Delete pdf extension.
Good luck
Ernesto.
PD: I'm a nutch beginner, but how nobody did response you, I try to help
you.
steven shingler escribió:
Hi all,
I'm trying to find out which filetypes nutch will cache.
for example: it does html, but not pdf.
Is there any documentation on how different filetypes are handled?
Is it possible to configure nutch to cache pdfs etc?
Any advice very gratefully received.
Thanks,
Steve
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas