hi there,

I have problem to configure nutch 07 to let it
crawling and query MS-word and pdf file correctly.

1.
I adding lines in nutch-site.xml as followings:

"
<!-- plugin properties -->
<property>
  <name>plugin.includes</name>
  <value>

nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword)|
index-(basic|pdf|msword)|
query-(basic|site|url|pdf|msword)

</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.
  In any case you need at least include the
nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
"

2.
I check regex-urlfilter.txt, that I didn't exclude pdf
and ms-word

3.
I checked mime-type.xml, all the set for pdf and
ms-word are there.

4.
I checked nutch fetching log, pdf and ms-word plugin
are applied correctly as followings:

"060327 204736 parsing:
C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-msword\plugin.xml
060327 204736 impl:
point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.msword.MSWordParser
060327 204736 parsing:
C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-pdf\plugin.xml
060327 204736 impl:
point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.pdf.PdfParser
"

I wonder if I still missing something in
configuration.

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to