[Nutch-general] Re: details: stackoverflow error

Rajesh Munavalli Fri, 07 Apr 2006 12:57:05 -0700

Hi Piotr,
         Thanks for the help. I think I found the source of the error. It
was in the "crawl-urlfilter.txt".


I had the following reg expression to grab all the URLs
+^http://([a-z0-9]*\.)*(a-z0-9*)*

The regex factory should have ran into infinite loop.

Thanks,

Rajesh


On 4/7/06, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
>
> Hello Rajesh,
> I have run  bin/nutch crawl urls -dir crawl.test -depth 3
> on standard nutch-0.7.2 setup.
> The urls file contain http://www.math.psu.edu/MathLists/Contents.htmlonly.
> In crawl-rlfilter I have changed the url pattern to:
> # accept hosts in MY.DOMAIN.NAME
> +^http://
>
> JVM: java version "1.4.2_06"
> Linux
>
> It runs without problems.
> Please reinstall from distribution make only required changes and
> retest. If it fails we will to track it down again.
> Regards
> Piotr
>
>
>
> Rajesh Munavalli wrote:
> > Forgot to mention one more parameter. Modify the crawl-urlfilter to
> accept
> > any URL.
> >
> > On 4/6/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote:
> >>  Java version: JSDK 1.4.2_08
> >> URL Seed: http://www.math.psu.edu/MathLists/Contents.html
> >>
> >> I even tried allocating more stack memory using "-Xss", process memory
> >> "-Xms" option. However, if I run the individual tools (fetchlisttool,
> >> fetcher, updatedb..etc) separately from the shell, it works fine.
> >>
> >> Thanks,
> >>  --Rajesh
> >>
> >>
> >>
> >> On 4/6/06, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
> >>> Which Java version do you use?
> >>> Is it the same for all urls or only for specific one?
> >>> If URL you are trying to crawl is public you can send it to me (off
> list
> >>>
> >>> if you wish) and I can check it on my machine.
> >>> Regards
> >>> Piotr
> >>>
> >>> Rajesh Munavalli wrote:
> >>>> I had earlier posted this message to the list but havent got any
> >>> response.
> >>>> Here are more details.
> >>>>
> >>>> Nutch versionI: nutch.0.7.2
> >>>> URL File: contains a single URL. File name: "urls"
> >>>> Crawl-url-filter: is set to grab all URLs
> >>>>
> >>>> Command: bin/nutch crawl urls -dir crawl.test -depth 3
> >>>> Error: java.lang.StackOverflowError
> >>>>
> >>>> The error occurrs while it executes the "UpdateDatabaseTool".
> >>>>
> >>>> One solution I can think of is to provide more stack memory. But is
> >>> there a
> >>>> better solution to this?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Rajesh
> >>>>
> >>>
> >>>
> >>>
> >
>
>

[Nutch-general] Re: details: stackoverflow error

Reply via email to