The crawl itself is quite large (millions of pages). But this particular error should have nothing to do with the crawl size. It seems that one of the pages has an extremely deep HTML DOM tree, and, since getMetaTagsHelper uses recursion to go through this, the StackOverFlowError is throwing up. Once I get a hold of the particular page which is causing this error, I'll be able to confirm this.
I'm using Nutch 1.0-dev (svn trunk). Thanks, Siddhartha On Thu, Jun 12, 2008 at 9:16 AM, Jason Boss <[EMAIL PROTECTED]> wrote: > How big is this crawl you are doing? > > What version of nutch? > > Jason > > > On Wed, Jun 11, 2008 at 8:32 PM, Siddhartha Reddy <[EMAIL PROTECTED]> wrote: > > Hi, > > > > While parsing some pages, I am getting a java.lang.StackOverflowError > > exception due to the recursion in HTMLMetaProcessor.getMetaTagsHelper. > I'm > > pasting part of the stack trace below. Unfortunately, I've logic that > > deletes the segment if fetch/parse fails, so I do not know which > particular > > web page caused this problem; I'll recrawl the same pages with modified > > logic (that does not delete the segment on failed parsing) and try to > find > > the offending URL. > > > > Did anyone encounter such a problem before? Apart from increasing the > stack > > size for Java, is there any other possible solution? > > > > java.lang.StackOverflowError > > at java.lang.Character.toUpperCase(Character.java:4278) > > at java.lang.String.regionMatches(String.java:1384) > > at java.lang.String.equalsIgnoreCase(String.java:1120) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:55) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:208) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:208) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:208) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:208) > > at > > > org.apache.nutch.parse.html.HTMLMetaProcessor.getMetaTagsHelper(HTMLMetaProcessor.java:208) > > .... > > > > Thanks, > > Siddhartha > > > -- http://sids.in "If you are not having fun, you are not doing it right."
