Hi,
Thanks for your reply. I'm running the crawl again now with that expression
added in, so it will be interesting to see the results later.
I also changed the protocol to <value>protocol-(http|https|ftp)|parse ....
to try and pick up ftp and https sites as well. However it looks as if https
doesn't work:
050610 082219 fetch of http://my.bp.com/login.do failed with:
org.apache.nutch.protocol.http.HttpException: Not an HTTP
url:https://my.bp.com/password/redirect.jsp
Have I configured this wrong, or Is ssl support not added in yet?
JS.
first sorry for my english:
you should see the conf/ nutch-default.xml :
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
by default nutch only works with text and html files, then you should
do some changes in the
conf/nutch-site.xml, to parse msword:
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-(text|html|msword|pdf|rtf)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
</nutch-conf>
On 6/9/05, J S <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Complete newbie here so sorry if this is a silly question! I was
wondering
> about the following message in the crawl.log I have:
>
> 050609 221715 fetch okay, but can't parse
>
http://planet.bp.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/BAKC+10/$FILE/No+10.+Customer+Compensation.doc,
> reason: Content-Type not text/html: application/msword
>
> Would my search be more efficient if turned on the plugin to parse
> microsoft word docs? If so, how do I turn the plugin on?
>
> Thanks for any help,
>
> JS.
>
>
>