Re: Parallel Webcrawler Implementation

Ken Krugler Thu, 24 Sep 2009 12:28:29 -0700

Hi Tobi,

First, I'd suggest getting and reading through the sources of existingJava-based web crawlers. They all use HttpClient, and thus wouldprovide much useful example code:


Nutch (Apache)
Droids (Apache)
Heritrix (Archive)
Bixo (http://bixo.101tec.com)

Some comments below:

On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:

Hi Guys,
I am working on a parallel webcrawler implementation in Java. Icould use some help with some design question and a bug that takesmy sleep ;-)
First thing, this is my design: I have a list, which stores URL'sthat have been crawled already. Furhter I have a Queue which isresponsible to provide the crawler with the next URL to fetch. ThenI have a ThreadController which spawns new crawler-threads until amaximum number is reached. Finally there are crawler-threads thatprocess a URL given by the queue. They work until the queue size iszero and then the system stops.
Following is my question: I am using (basically) the followingstatements. As I am new to httpclient this could probably a dumpapproach, and I am happy for feedback.
<snip from WebCrawlerThread>
DefaultHttpClient client;
HttpGet get;

  public run() {
      client = new DefaultHttpClient();
      HttpResponse response = client.execute(get);
      HttpEntity entity = response.getEntity();
      String mimetype = entity.getContentType().getValue();
      String rawPage = EntityUtils.toString(entity);
      client.getConnectionManager().shutdown();

     (...) doing crawler things
  }
</snap>
First thing: Is the thread the right place to host the clientobject, or should it be shared?

You should use the ThreadSafeClientConnManager, and reuse the sameDefaultHttpClient instance for all threads.

See the init() method of Bixo's SimpleHttpFetcher class for an exampleof setting this up.

Second: Would it enhance performance if I reuse the connectionsomehow?

Yes, via keep-alive. Though you then have to be a bit more carefulabout handling stale connections (ones that the server has shut down).

Again, take a look at the Bixo SimpleHttpFetcher class for some codethat tries (at least) to do this properly.

And most important the bug: With increasing number of pages Ireceive zillions of
"java.net.BindException: Address already in use: connect"


No idea, sorry.

But I think that by default HttpClient limits the number of parallelrequest to one host to be two. Not sure if that would be a factor inyour case, given how you're creating a new client for each request.


-- Ken



--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Re: Parallel Webcrawler Implementation

Reply via email to