Just for the mail archives: please see also NUTCH-89.

Thread closed?

Michael



[EMAIL PROTECTED] wrote:

Hi Michael,

I going back to a nigthly build.
I think this problem is related to 'fetcher.threads.per.host' value, when it is bigger than 1. There is another possible sources: fetcher.threads.fetch or fetcher.threads.per.host or parser.threads.parse.

Best Regards,
   Ferenc

Hi Ferenc,

I see the same errors. As I've seen a running installation yesterday, I think it's a configuration mistake. By now I have no idea where. Have you made any progress?

Regards

    Michael


[EMAIL PROTECTED] wrote:

Dear Developers!

I tested nutch 0.7 with all the parser plugins, and found the followings:

-------------------------------------------------------------------------
The fetch broken by with e.g. followings:
------------------------------------------------------------------------- 050901 110915 fetch okay, but can't parse http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, reason: failed (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved files are unsupported at this time
050901 110915 fetching http://en.mimi.hu/fishing/scad.html
050901 110917 SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
       at org.apache.nutch.parse.ParseData.write(ParseData.java:109)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
       at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
       at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110917 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
       at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
       at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
       at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) Exception in thread "main" java.lang.RuntimeException: SEVERE error logged. Exiting fetcher.
       at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
       at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
       at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
       at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
       at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
       at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
       at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
       at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319
java.io.IOException: key out of order: 319 after 319
       at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134)
       at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120)
       at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
etc.

---------------------------------------------------------------------------
There are the differences between nutch-site.xml and nutch-default.xml:
---------------------------------------------------------------------------
***** nutch-default.xml
 <name>http.timeout</name>
 <value>10000</value>
<description>The default network timeout, in milliseconds.</description>
***** NUTCH-SITE.XML
 <name>http.timeout</name>
 <value>30000</value>
<description>The default network timeout, in milliseconds.</description>
*****

***** nutch-default.xml
 <name>http.max.delays</name>
 <value>3</value>
 <description>The number of times a thread will delay when trying to
***** NUTCH-SITE.XML
 <name>http.max.delays</name>
 <value>6</value>
 <description>The number of times a thread will delay when trying to
*****

***** nutch-default.xml
 <name>http.content.limit</name>
 <value>65536</value>
 <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
 <name>http.content.limit</name>
 <value>130000</value>
 <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
 <name>file.content.limit</name>
 <value>65536</value>
 <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
 <name>file.content.limit</name>
 <value>130000</value>
 <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
 <name>ftp.content.limit</name>
 <value>65536</value>
 <description>The length limit for downloaded content, in bytes.
***** NUTCH-SITE.XML
 <name>ftp.content.limit</name>
 <value>130000</value>
 <description>The length limit for downloaded content, in bytes.
*****

***** nutch-default.xml
 <name>db.max.outlinks.per.page</name>
 <value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
***** NUTCH-SITE.XML
 <name>db.max.outlinks.per.page</name>
 <value>200</value>
<description>The maximum number of outlinks that we'll process for a page.
*****

***** nutch-default.xml
 <name>db.fetch.retry.max</name>
 <value>3</value>
 <description>The maximum number of times a url that has encountered
***** NUTCH-SITE.XML
 <name>db.fetch.retry.max</name>
 <value>6</value>
 <description>The maximum number of times a url that has encountered
*****

***** nutch-default.xml
 <name>fetcher.server.delay</name>
 <value>5.0</value>
 <description>The number of seconds the fetcher will delay between
***** NUTCH-SITE.XML
 <name>fetcher.server.delay</name>
 <value>30.0</value>
 <description>The number of seconds the fetcher will delay between
*****

***** nutch-default.xml
 <name>fetcher.threads.fetch</name>
 <value>10</value>
 <description>The number of FetcherThreads the fetcher should use.
***** NUTCH-SITE.XML
 <name>fetcher.threads.fetch</name>
 <value>100</value>
 <description>The number of FetcherThreads the fetcher should use.
*****

***** nutch-default.xml
 <name>fetcher.threads.per.host</name>
 <value>1</value>
 <description>This number is the maximum number of threads that
***** NUTCH-SITE.XML
 <name>fetcher.threads.per.host</name>
 <value>100</value>
 <description>This number is the maximum number of threads that
*****

***** nutch-default.xml
 <name>parser.threads.parse</name>
 <value>10</value>
<description>Number of ParserThreads ParseSegment should use.</description>
***** NUTCH-SITE.XML
 <name>parser.threads.parse</name>
 <value>100</value>
<description>Number of ParserThreads ParseSegment should use.</description>
*****

***** nutch-default.xml
 <name>indexer.minMergeDocs</name>
 <value>50</value>
 <description>This number determines the minimum number of Lucene
***** NUTCH-SITE.XML
 <name>indexer.minMergeDocs</name>
 <value>10000</value>
 <description>This number determines the minimum number of Lucene
*****

***** nutch-default.xml
 <name>indexer.maxMergeDocs</name>
 <value>50</value>
 <description>This number determines the maximum number of Lucene
***** NUTCH-SITE.XML
 <name>indexer.maxMergeDocs</name>
 <value>10000000</value>
 <description>This number determines the maximum number of Lucene
*****

***** nutch-default.xml
 <name>searcher.dir</name>
 <value>.</value>
 <description>
***** NUTCH-SITE.XML
 <name>searcher.dir</name>
 <value>/srv/db/</value>
 <description>
*****

***** nutch-default.xml
 <name>ipc.client.timeout</name>
 <value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds. </description>
***** NUTCH-SITE.XML
 <name>ipc.client.timeout</name>
 <value>20000</value>
<description>Defines the timeout for IPC calls in milliseconds. </description>
*****

***** nutch-default.xml
 <name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
***** NUTCH-SITE.XML
 <name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|rtf|msword|msexcel|mspowerpoint)|index-(basic|more)|query-
basic|more|site|url)</value>
 <description>Regular expression naming plugin directory names to
*****

***** nutch-default.xml
 <name>parser.character.encoding.default</name>
 <value>windows-1252</value>
<description>The character encoding to fall back to when no other information
***** NUTCH-SITE.XML
 <name>parser.character.encoding.default</name>
 <value>iso-8859-2</value>
<description>The character encoding to fall back to when no other information
*****

Any idea what is the problem source?

Best Regards:
   Ferenc






--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to