Jérôme Charron wrote:
Following are output from the fetcher and headers from the firefoxweb
developer toolbar.
I'd appreciate any thoughts. Perhaps something for parser policy. I've
traced the source code a bit and nothing jumped out at me...
Could you provide your plugins configuration, and the nutch startup logs.
Jérôme
Jerome,
See below.
--
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword|rss|ext)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
--
050923 020323 parsing file:/usr/local/nutch/conf/nutch-default.xml
050923 020323 parsing file:/usr/local/nutch/conf/nutch-site.xml
050923 020323 No FS indicated, using default:local
050923 020323 Plugins: looking in: /usr/local/nutch/plugins
050923 020323 not including: /usr/local/nutch/plugins/protocol-ftp
050923 020323 not including: /usr/local/nutch/plugins/urlfilter-prefix
050923 020323 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
050923 020323 not including: /usr/local/nutch/plugins/ontology
050923 020323 parsing: /usr/local/nutch/plugins/parse-ext/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-rss/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.rss.RSSParser
050923 020323 parsing:
/usr/local/nutch/plugins/protocol-httpclient/plugin.xml
050923 020323 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 parsing: /usr/local/nutch/plugins/parse-pdf/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.pdf.PdfParser
050923 020323 not including: /usr/local/nutch/plugins/creativecommons
050923 020323 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-msword/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.msword.MSWordParser
050923 020323 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/protocol-http
050923 020323 not including: /usr/local/nutch/plugins/index-more
050923 020323 not including: /usr/local/nutch/plugins/query-more
050923 020323 not including: /usr/local/nutch/plugins/parse-js
050923 020323 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050923 020323 not including: /usr/local/nutch/plugins/language-identifier
050923 020323 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/clustering-carrot2
050923 020323 not including: /usr/local/nutch/plugins/protocol-file
050923 020323 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050923 020323 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
050923 020323 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
050923 020323 logging at INFO
050923 020323 fetching
http://vet.osu.edu/assets/courses/vm602/quotes/quote46.html
050923 020323 fetching
http://vet.osu.edu/assets/courses/vm562/muir/sedatives.pdf
050923 020323 http.proxy.host = null
050923 020323 http.proxy.port = 8080
050923 020323 http.timeout = 10000
050923 020323 http.content.limit = 7168000
050923 020323 http.agent = Nutch/0.7 ( nutch; http://xxxxxxx,
[EMAIL PROTECTED])
050923 020323 http.auth.ntlm.username =
050923 020323 fetcher.server.delay = 3000
050923 020323 http.max.delays = 10
050923 020324 Configured Client