[Nutch-general] Re: details: stackoverflow error

Stefan Groschupf Fri, 07 Apr 2006 13:04:07 -0700

I already suggested to add a kind of timeout mechanism here and haddone this for my installation,however the patch suggestion was rejected since it was a 'nonreproducible' problem.

:-/


Am 07.04.2006 um 21:55 schrieb Rajesh Munavalli:

Hi Piotr,

Thanks for the help. I think I found the source of theerror. It

was in the "crawl-urlfilter.txt".

I had the following reg expression to grab all the URLs
+^http://([a-z0-9]*\.)*(a-z0-9*)*

The regex factory should have ran into infinite loop.

Thanks,

Rajesh


On 4/7/06, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:


Hello Rajesh,
I have run  bin/nutch crawl urls -dir crawl.test -depth 3
on standard nutch-0.7.2 setup.

The urls file contain http://www.math.psu.edu/MathLists/Contents.htmlonly.

In crawl-rlfilter I have changed the url pattern to:
# accept hosts in MY.DOMAIN.NAME
+^http://

JVM: java version "1.4.2_06"
Linux

It runs without problems.
Please reinstall from distribution make only required changes and
retest. If it fails we will to track it down again.
Regards
Piotr



Rajesh Munavalli wrote:

Forgot to mention one more parameter. Modify the crawl-urlfilter to

accept

any URL.

On 4/6/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
 Java version: JSDK 1.4.2_08
URL Seed: http://www.math.psu.edu/MathLists/Contents.html
I even tried allocating more stack memory using "-Xss", processmemory"-Xms" option. However, if I run the individual tools(fetchlisttool,
fetcher, updatedb..etc) separately from the shell, it works fine.

Thanks,
 --Rajesh



On 4/6/06, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me(off

list


if you wish) and I can check it on my machine.
Regards
Piotr

Rajesh Munavalli wrote:

I had earlier posted this message to the list but havent got any

response.

Here are more details.

Nutch versionI: nutch.0.7.2
URL File: contains a single URL. File name: "urls"
Crawl-url-filter: is set to grab all URLs

Command: bin/nutch crawl urls -dir crawl.test -depth 3
Error: java.lang.StackOverflowError

The error occurrs while it executes the "UpdateDatabaseTool".

One solution I can think of is to provide more stack memory.But is

there a

better solution to this?

Thanks,

Rajesh


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: details: stackoverflow error

Reply via email to