Hi-
I am using httpclient in a multi-threaded webcrawler application. I am using
the MulitThreadedHttpConnectionManager in conjunction with 300 threads that
download pages from various sites.
Problem is that I am running out of memory shortly after the process begins. I
used JProfiler to analyze the memory stacks and it points to:
a.. 76.2% - 233,587 kB - 6,626 alloc.
org.apache.commons.httpclient.HttpMethod.getResponseBodyAsString
as the culprit (at most there should be a little over 300 allocations as there
are 300 threads operating at once). Other relevant information, I am on a
Windows XP Pro platform using the SUN JRE that came with jdk1.5.0_06. I am
using commons-httpclient-3.0.jar.
Here is the code where I initialize the HttpClient:
private HttpClient httpClient;
public CrawlerControllerThread(QueueThread qt, MessageReceiver receiver, int
maxThreads, String flag,
boolean filter, String filterString, String dbType) {
this.qt = qt;
this.receiver = receiver;
this.maxThreads = maxThreads;
this.flag = flag;
this.filter = filter;
this.filterString = filterString;
this.dbType = dbType;
threads = new ArrayList();
lastStatus = new HashMap();
HttpConnectionManagerParams htcmp = new HttpConnectionManagerParams();
htcmp.setMaxTotalConnections(maxThreads);
htcmp.setDefaultMaxConnectionsPerHost(10);
htcmp.setSoTimeout(5000);
MultiThreadedHttpConnectionManager mtcm = new
MultiThreadedHttpConnectionManager();
mtcm.setParams(htcmp);
httpClient = new HttpClient(mtcm);
}
The client reference to httpClient is then passed to all the crawling threads
where it is used as follows:
private String getPageApache(URL pageURL, ArrayList unProcessed) {
SaveURL saveURL = new SaveURL();
HttpMethod method = null;
HttpURLConnection urlConnection = null;
String rawPage = "";
try {
method = new GetMethod(pageURL.toExternalForm());
method.setFollowRedirects(true);
method.setRequestHeader("Content-type", "text/html");
int statusCode = httpClient.executeMethod(method);
// urlConnection = new HttpURLConnection(method,
// pageURL);
logger.debug("Requesting: "+pageURL.toExternalForm());
rawPage = method.getResponseBodyAsString();
//rawPage = saveURL.getURL(urlConnection);
if(rawPage == null){
unProcessed.add(pageURL);
}
return rawPage;
} catch (IllegalArgumentException e) {
//e.printStackTrace();
}
catch (HttpException e) {
//e.printStackTrace();
} catch (IOException e) {
unProcessed.add(pageURL);
//e.printStackTrace();
}finally {
if(method != null) {
method.releaseConnection();
}
try {
if(urlConnection != null) {
if(urlConnection.getInputStream() != null) {
urlConnection.getInputStream().close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
urlConnection = null;
method = null;
}
return null;
}
As you can see, I release the connection in the finally statement, so that
should not be a problem. Upon running the getPageApache above the returned page
as a string is processed and then set to null for garbage collection. I have
been playing with this, closing streams, using HttpUrlConnection instead of the
GetMethod, and I cannot find the answer. Indeed it seems the answer does not
lie in my code.
I greatly appreciate any help that anyone can give me, I am at the end of my
ropes with this one.
James