hi Sudhendra:

I use the same configuration as you suggested in
nutch-site.xml

I did a testing and after look at the fetch log, found
the following error message

"
fetch okay, but can't parse
http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
reason: failed(2,203): Content-Type not text/html:
application/pdf
"

Does that mean pdf is downloaded but doesn't parse
successfully? So we can't search the word in pdf file
directly?

thanks,

Michael,

By the way, I use nutch 07 to do testing.



--- sudhendra seshachala <[EMAIL PROTECTED]> wrote:

> In Nutch-default.xml,
> Include plugin for word and PDF as below.
> 
> <property>
>   <name>plugin.includes</name>
>  
>
<value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value>
>   <description>Regular expression naming plugin
> directory names to
>   include.  Any plugin not matching this expression
> is excluded.
>   In any case you need at least include the
> nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and
> plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> But reco is to include the property in
> nutch-site.xml
> 
> Hope this helps.
> 
> Michael Ji <[EMAIL PROTECTED]> wrote: 
> hi there,
> 
> Is there any specific setting need to be added in
> configuration file in order to crawl and index pdf
> and
> word file?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
>               
> ---------------------------------
> Blab-away for as little as 1ยข/min. Make  PC-to-Phone
> Calls using Yahoo! Messenger with Voice.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to