ysc created NUTCH-1739:
--------------------------
Summary: ExecutorService field in ParseUtil.java not be right use
and cause memory leak
Key: NUTCH-1739
URL: https://issues.apache.org/jira/browse/NUTCH-1739
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 2.2.1, 1.8, 2.2, 1.7, 2.1, 1.6
Environment: JDK32, runtime/local
Reporter: ysc
Priority: Critical
########################Problem########################
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native
thread
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at
java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
########################Analysis########################
My server use JDK32. I began thought it was not specify enough memory. I passed
the test of {{java -Xmx2600m -version}} so I known my server can use the max
memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to the
script of bin/nutch. But it's not solve the problem.
Then, I check the source code to see where to produce so many threads. I find
the code
{code:java}
parseResult = new ParseUtil(getConf()).parse(content);
{code}
which in line 97 of the java source file
org.apache.nutch.parse.ParseSegment.java's map method.
Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool
object which no limit of the pool size , see the code:
{code:java}
executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
.setNameFormat("parse-%d").setDaemon(true).build());
{code}
Through the above analyse, I know each map method's output will instantiate a
CachedThreadPool and not to close it. So, ExecutorService field in
ParseUtil.java not be right use and cause memory leak.
########################Solution########################
Each map method use a shared FixedThreadPool object which's size can be config
in nutch-site.xml, more detail see the patch file.
--
This message was sent by Atlassian JIRA
(v6.2#6252)