i believe it can.
check your configuration files, nutch-site.xml and nutch-default.xml.

you will find something like

<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-regex|parse-(text|html|swf|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems
with the
  underlying commons-httpclient library.
  </description>
</property>

add to the parsers "msword".
change
parse-(text|html|swf|pdf)|
to
parse-(text|html|swf|pdf|msword)

there is a plugin in plugins folder,
which is parsing ms word documents.
parse-msword    

i have not tried it so far.

Jair Piedrahita Vargas schrieb:
> Can Nutch search inside the content of an msword file? I've tried, but it 
> says "parser not found for contentType=application/msword"
> What can I do to correct this Error?
>
> Thanks
>
> JAIR PIEDRAHITA VARGAS
> Gerencia de Investigación y Nuevas Tecnologías
> Teléfono: 4040000   Ext 41632
> Av. los Industriales Cra 48 # 26-85 piso 6B
> BANCOLOMBIA S.A
>
>
> ________________________________
> El contenido de este mensaje puede ser información privilegiada y 
> confidencial. Si usted no es el destinatario real del mismo, por favor 
> informe de ello a quien lo envía y destrúyalo en forma inmediata. Está 
> prohibida su retención, grabación, utilización o divulgación con cualquier 
> propósito. Este mensaje ha sido verificado con software antivirus; en 
> consecuencia, el remitente de éste no se hace responsable por la presencia en 
> él o en sus anexos de algún virus que pueda generar daños en los equipos o 
> programas del destinatario.
> ******************************************************************************************************
> This communication (including all attachments) may contain information that 
> is private, confidential and privileged. If you have received this 
> communication in error; please notify the sender immediately, delete this 
> communication from all data storage devices and destroy all hard copies. Any 
> use, dissemination, distribution, copying or disclosure of this message and 
> any attachments, in whole or in part, by anyone other than the intended 
> recipient(s) is strictly prohibited. This message has been checked with an 
> antivirus software; accordingly, the sender is not liable for the presence of 
> any virus in attachments that causes or may cause damage to the recipient's 
> equipment or software.
>
>   

  • question Jair Piedrahita Vargas
    • Re: question reinhard schwab

Reply via email to