Hi Steven

I don't know if I understand completely your email.
What you mean with "cache"?

If do you want to crawl pdf's, you need to delete the url filter for that.

In your crawl-urlfilter.txt, do you have a line starting with a minus 
and a list of file extensions. Delete pdf extension.

Good luck
Ernesto.
PD: I'm a nutch beginner, but how nobody did response you, I try to help 
you.


steven shingler escribió:
> Hi all,
>
> I'm trying to find out which filetypes nutch will cache.
>
> for example: it does html, but not pdf.
>
> Is there any documentation on how different filetypes are handled?
>
> Is it possible to configure nutch to cache pdfs etc?
>
> Any advice very gratefully received.
> Thanks,
> Steve
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
>   

        
        
                
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to