Re: Parallel Webcrawler Implementation

Oleg Kalnichevski Thu, 24 Sep 2009 13:07:44 -0700

sebb wrote:

On 24/09/2009, Ken Krugler <[email protected]> wrote:

Hi Tobi,


 First, I'd suggest getting and reading through the sources of existing
Java-based web crawlers. They all use HttpClient, and thus would provide
much useful example code:

 Nutch (Apache)
 Droids (Apache)
 Heritrix (Archive)
 Bixo (http://bixo.101tec.com)

 Some comments below:

 On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:

Hi Guys,

I am working on a parallel webcrawler implementation in Java. I could use

some help with some design question and a bug that takes my sleep ;-)

First thing, this is my design: I have a list, which stores URL's that

have been crawled already. Furhter I have a Queue which is responsible to
provide the crawler with the next URL to fetch. Then I have a
ThreadController which spawns new crawler-threads until a maximum number is
reached. Finally there are crawler-threads that process a URL given by the
queue. They work until the queue size is zero and then the system stops.

Following is my question: I am using (basically) the following statements.

As I am new to httpclient this could probably a dump approach, and I am
happy for feedback.

<snip from WebCrawlerThread>
DefaultHttpClient client;
HttpGet get;

 public run() {
     client = new DefaultHttpClient();
     HttpResponse response = client.execute(get);
     HttpEntity entity = response.getEntity();
     String mimetype =

entity.getContentType().getValue();

     String rawPage = EntityUtils.toString(entity);
     client.getConnectionManager().shutdown();

    (...) doing crawler things
 }
</snap>

First thing: Is the thread the right place to host the client object, or

should it be shared?
 You should use the ThreadSafeClientConnManager, and reuse the same
DefaultHttpClient instance for all threads.

 See the init() method of Bixo's SimpleHttpFetcher class for an example of
setting this up.

Second: Would it enhance performance if I reuse the connection somehow?

 Yes, via keep-alive. Though you then have to be a bit more careful about
handling stale connections (ones that the server has shut down).

 Again, take a look at the Bixo SimpleHttpFetcher class for some code that
tries (at least) to do this properly.

And most important the bug: With increasing number of pages I receive

zillions of

"java.net.BindException: Address already in use: connect"


I've seen this error generated when a WinXP host runs out of sockets.
i.e. the message is misleading in this case.

Which is hardly surprising given that the crawler creates a newconnection per EACH request / link.


Oleg

 No idea, sorry.

 But I think that by default HttpClient limits the number of parallel
request to one host to be two. Not sure if that would be a factor in your
case, given how you're creating a new client for each request.

 -- Ken



 --------------------------
 Ken Krugler
 TransPac Software, Inc.
 <http://www.transpac.com>
 +1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Parallel Webcrawler Implementation

Reply via email to