Dear Developers!
I tested nutch 0.7 with all the parser plugins, and found the
followings:
-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
-------------------------------------------------------------------------
050901 110915 fetch okay, but can't parse
http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc,
reason: failed
(2,200): org.apache.nutch.parse.msword.FastSavedException:
Fast-saved files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing
output:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110917 SEVERE error writing output:java.io.IOException: key
out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key
out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key
out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
050901 110921 SEVERE error writing output:java.io.IOException: key
out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
at
org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
at
org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.
---------------------------------------------------------------------------
There are the differences between nutch-site.xml and
nutch-default.xml:
---------------------------------------------------------------------------
***** nutch-default.xml
<name>http.timeout</name>
<value>10000</value>
<description>The default network timeout, in
milliseconds.</description>
***** NUTCH-SITE.XML
<name>http.timeout</name>
<value>30000</value>
<description>The default network timeout, in
milliseconds.</description>
*****
***** nutch-default.xml
<name>http.max.delays</name>
<value>3</value>
<description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
<name>http.max.delays</name>
<value>6</value>
<description>The number of times a thread will delay when trying to
*****
***** nutch-default.xml
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>http.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>file.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>file.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>ftp.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
<name>ftp.content.limit</name>
<value>130000</value>
<description>The length limit for downloaded content, in bytes.
*****
***** nutch-default.xml
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for
a page.
***** NUTCH-SITE.XML
<name>db.max.outlinks.per.page</name>
<value>200</value>
<description>The maximum number of outlinks that we'll process for
a page.
*****
***** nutch-default.xml
<name>db.fetch.retry.max</name>
<value>3</value>
<description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
<name>db.fetch.retry.max</name>
<value>6</value>
<description>The maximum number of times a url that has encountered
*****
***** nutch-default.xml
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
<name>fetcher.server.delay</name>
<value>30.0</value>
<description>The number of seconds the fetcher will delay between
*****
***** nutch-default.xml
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
<name>fetcher.threads.fetch</name>
<value>100</value>
<description>The number of FetcherThreads the fetcher should use.
*****
***** nutch-default.xml
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
<name>fetcher.threads.per.host</name>
<value>100</value>
<description>This number is the maximum number of threads that
*****
***** nutch-default.xml
<name>parser.threads.parse</name>
<value>10</value>
<description>Number of ParserThreads ParseSegment should
use.</description>
***** NUTCH-SITE.XML
<name>parser.threads.parse</name>
<value>100</value>
<description>Number of ParserThreads ParseSegment should
use.</description>
*****
***** nutch-default.xml
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
<name>indexer.minMergeDocs</name>
<value>10000</value>
<description>This number determines the minimum number of Lucene
*****
***** nutch-default.xml
<name>indexer.maxMergeDocs</name>
<value>50</value>
<description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
<name>indexer.maxMergeDocs</name>
<value>10000000</value>
<description>This number determines the maximum number of Lucene
*****
***** nutch-default.xml
<name>searcher.dir</name>
<value>.</value>
<description>
***** NUTCH-SITE.XML
<name>searcher.dir</name>
<value>/srv/db/</value>
<description>
*****
***** nutch-default.xml
<name>ipc.client.timeout</name>
<value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds.
</description>
***** NUTCH-SITE.XML
<name>ipc.client.timeout</name>
<value>20000</value>
<description>Defines the timeout for IPC calls in milliseconds.
</description>
*****
***** nutch-default.xml
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
<description>Regular expression naming plugin directory names to
*****
***** nutch-default.xml
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other
information
***** NUTCH-SITE.XML
<name>parser.character.encoding.default</name>
<value>iso-8859-2</value>
<description>The character encoding to fall back to when no other
information
*****
Any idea what is the problem source?
Best Regards:
Ferenc