Jérôme Charron wrote:

    Following are output from the fetcher and headers from the firefoxweb
    developer toolbar.

    I'd appreciate any thoughts.  Perhaps something for parser policy.  I've
    traced the source code a bit and nothing jumped out at me...

Could you provide your plugins configuration, and the nutch startup logs.

Jérôme

Jerome,

  See below.

--

<property>
  <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword|rss|ext)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>


--

050923 020323 parsing file:/usr/local/nutch/conf/nutch-default.xml
050923 020323 parsing file:/usr/local/nutch/conf/nutch-site.xml
050923 020323 No FS indicated, using default:local
050923 020323 Plugins: looking in: /usr/local/nutch/plugins
050923 020323 not including: /usr/local/nutch/plugins/protocol-ftp
050923 020323 not including: /usr/local/nutch/plugins/urlfilter-prefix
050923 020323 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
050923 020323 not including: /usr/local/nutch/plugins/ontology
050923 020323 parsing: /usr/local/nutch/plugins/parse-ext/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.ext.ExtParser 050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.ext.ExtParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-rss/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.rss.RSSParser 050923 020323 parsing: /usr/local/nutch/plugins/protocol-httpclient/plugin.xml 050923 020323 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http 050923 020323 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.httpclient.Http
050923 020323 parsing: /usr/local/nutch/plugins/parse-pdf/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.pdf.PdfParser
050923 020323 not including: /usr/local/nutch/plugins/creativecommons
050923 020323 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-msword/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.msword.MSWordParser
050923 020323 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/protocol-http
050923 020323 not including: /usr/local/nutch/plugins/index-more
050923 020323 not including: /usr/local/nutch/plugins/query-more
050923 020323 not including: /usr/local/nutch/plugins/parse-js
050923 020323 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050923 020323 not including: /usr/local/nutch/plugins/language-identifier
050923 020323 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/clustering-carrot2
050923 020323 not including: /usr/local/nutch/plugins/protocol-file
050923 020323 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050923 020323 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
050923 020323 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
050923 020323 logging at INFO
050923 020323 fetching http://vet.osu.edu/assets/courses/vm602/quotes/quote46.html 050923 020323 fetching http://vet.osu.edu/assets/courses/vm562/muir/sedatives.pdf
050923 020323 http.proxy.host = null
050923 020323 http.proxy.port = 8080
050923 020323 http.timeout = 10000
050923 020323 http.content.limit = 7168000
050923 020323 http.agent = Nutch/0.7 ( nutch; http://xxxxxxx, [EMAIL PROTECTED])
050923 020323 http.auth.ntlm.username =
050923 020323 fetcher.server.delay = 3000
050923 020323 http.max.delays = 10
050923 020324 Configured Client

Reply via email to