[
https://issues.apache.org/jira/browse/NUTCH-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938962#comment-13938962
]
Alparslan Avcı commented on NUTCH-1739:
---------------------------------------
Hi [~yangshangchuan], and thanks for the patch!
IMHO, FixedThreadPool is not needed in this case. As you can see in the source
code of _Executors.java_; _newCachedThreadPool()_ method is implemented as
follows:
{code:java}
public static ExecutorService newCachedThreadPool(ThreadFactory
threadFactory) {
return new ThreadPoolExecutor(0, Integer.MAX_VALUE,
60L, TimeUnit.SECONDS,
new SynchronousQueue<Runnable>(),
threadFactory);
}
{code}
It is seen that the keepAliveTime parameter is given as 60 seconds, means that
idle threads will wait 60 sec for new tasks before terminating. So, the threads
will created as needed and killed when they are idle. And as an experience, we
have parsed ten millions of webpages and never faced a problem when we use
CachedThreadPool. Another point is that configuring the fixed size of
threadpools is a hard issue when the size of crawled webpages is too large.
> ExecutorService field in ParseUtil.java not be right use and cause memory leak
> ------------------------------------------------------------------------------
>
> Key: NUTCH-1739
> URL: https://issues.apache.org/jira/browse/NUTCH-1739
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.6, 2.1, 1.7, 2.2, 1.8, 2.2.1
> Environment: JDK32, runtime/local
> Reporter: ysc
> Priority: Critical
> Attachments: nutch1.7.patch, nutch2.2.1.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> ########################Problem########################
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new native
> thread
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:640)
> at
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> at
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> ########################Analysis########################
> My server use JDK32. I began thought it was not specify enough memory. I
> passed the test of {{java -Xmx2600m -version}} so I known my server can use
> the max memory is 2.6G. So, I add one line config {{NUTCH_HEAPSIZE=2000}} to
> the script of bin/nutch. But it's not solve the problem.
> Then, I check the source code to see where to produce so many threads. I find
> the code
> {code:java}
> parseResult = new ParseUtil(getConf()).parse(content);
> {code}
> which in line 97 of the java source file
> org.apache.nutch.parse.ParseSegment.java's map method.
> Continue, In the constructor of ParseUtil, instantiate a CachedThreadPool
> object which no limit of the pool size , see the code:
> {code:java}
> executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
> .setNameFormat("parse-%d").setDaemon(true).build());
> {code}
> Through the above analyse, I know each map method's output will instantiate
> a CachedThreadPool and not to close it. So, ExecutorService field in
> ParseUtil.java not be right use and cause memory leak.
> ########################Solution########################
> Each map method use a shared FixedThreadPool object which's size can be
> config in nutch-site.xml, more detail see the patch file.
--
This message was sent by Atlassian JIRA
(v6.2#6252)