> Googling reveals
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you
> could try increasing the Java stack size in bin/nutch (-Xss), or use
> an alternate regexp if you can.
> 
> Just out of curiosity, why does a performance critical program such as
> Nutch use Sun's backtracking-based regexp implementation rather than
> an efficient Thompson-based one?  Do you need the additional
> expressiveness provided by PCRE?


Very interesting point... we should use it for BIXO too.


BTW, SUN has memory leaks with LinkedBlockingQueue,
http://bugs.sun.com/view_bug.do?bug_id=6806875
http://tech.groups.yahoo.com/group/bixo-dev/message/329


And, of course, URL is synchronized; Apache Tomcat uses simplified version
of URL class.
And, RegexUrlNormalizer is synchronized in Nutch... 
And, in order to retrieve plain text from HTML we are creating fat DOM
object (instead of using, for instance, filters in NekoHtml)
And more...

-Fuad,
+1 416-993-2060
http://www.tokenizer.ca





Reply via email to