Re: [Nutch-dev] OOM error during parsing with nekohtml

Doğacan Güney Mon, 16 Jul 2007 23:35:40 -0700

Hi,

On 7/17/07, Shailendra Mudgal <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> Thanks for your suggestions.
>
> I am running parse on a single url (
> http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse
> works perfectly. we are getting this error because of the html of the page.
> The page contains many anchor tags which are not closed properly. Hence neko
> html parser throws this exception. The page can be parsed successfully using
> tagsoup. We think this as a bug in neko html parser.


Since tagsoup works and neko doesn't, I agree with you that this is a
bug with neko.

If you want to skip over this page (parser will not extract text from
this page but parsing will successfully run overall), you may try
changing catch clause in ParseSegment. java:77 from Exception to
Throwable. This should catch OOM and continue.

>
>
> Regards,
> Shailendra
>
>
>
>
>
>
>
> On 7/16/07, Tsengtan A Shuy <[EMAIL PROTECTED]> wrote:
> >
> > Thank you for the info.
> > The OOM exception in your previous email indicates that your system is
> > running out of heap memory.  You either have instantiated too many
> > objects,
> > or there are memory leaks in the source codes.
> >
> > Hope this will help you!
> > Cheer!!
> >
> > Adam Shuy, President
> > ePacific Web Design & Hosting
> > Professional Web/Software developer
> > TEL: 408-272-6946
> > www.epacificweb.com
> >
> > -----Original Message-----
> > From: Kai_testing Middleton [mailto:[EMAIL PROTECTED]
> > Sent: Monday, July 16, 2007 8:43 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: OOM error during parsing with nekohtml
> >
> > You could try looking at these two discussions:
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg06571.html
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg06571.html
> >
> > --Kai
> >
> > ----- Original Message ----
> > From: Tsengtan A Shuy <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > Sent: Monday, July 16, 2007 3:45:59 AM
> > Subject: RE: OOM error during parsing with nekohtml
> >
> > I successfully run the whole-web crawl with the my new ubuntu OS, and I am
> > ready to fix the bug.  I need someone to guide me to get the most updated
> > source code and the bug assignment.
> >
> > Thank you in advance!!
> >
> > Adam Shuy, President
> > ePacific Web Design & Hosting
> > Professional Web/Software developer
> > TEL: 408-272-6946
> > www.epacificweb.com
> > -----Original Message-----
> > From: Shailendra Mudgal [mailto:[EMAIL PROTECTED]
> > Sent: Monday, July 16, 2007 3:05 AM
> > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > Subject: OOM error during parsing with nekohtml
> >
> > Hi All,
> >
> > We are getting an OOM Exception during the processing of
> > http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied
> > Nutch-497 patch to our source code. But actually the error is coming
> > during
> > the parse method.
> > Does anybody has any idea regarding this.  Here is the complete stacktrace
> > :
> >
> > java.lang.OutOfMemoryError: Java heap space
> >     at java.lang.String.toUpperCase(String.java:2637)
> >     at java.lang.String.toUpperCase(String.java:2660)
> >     at
> > org.cyberneko.html.filters.NamespaceBinder.bindNamespaces(
> > NamespaceBinder.ja
> > va:443)
> >     at
> > org.cyberneko.html.filters.NamespaceBinder.startElement(
> > NamespaceBinder.java
> > :252)
> >     at
> > org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java
> > :100
> > 9)
> >     at
> > org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639)
> >     at
> > org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646)
> >     at
> > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(
> > HTMLScanner.j
> > ava:2343)
> >     at
> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820)
> >     at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
> >     at
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
> >     at
> > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
> >     at
> > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java
> > :16
> > 4)
> >     at
> > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265)
> >     at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229)
> >     at
> > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168)
> >     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
> >     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
> >     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
> >     at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> >
> >
> > Regards,
> > Shailendra
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > ____________________________________________________________________________
> > ________
> > Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated
> > for today's economy) at Yahoo! Games.
> > http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow
> >
> >
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] OOM error during parsing with nekohtml

Reply via email to