On Jan 26, 2010, at 3:54am, Claudio Martella wrote:

As I mentioned in the previous post, i'm using httpclient for a
webcrawler i'm writing. at the moment i'm doing something like this:


   while(toVisit.size() > 0){

                     client.execute(method);
                     String mime = getContentType(method); // which
does method.getResponseHeader("Content-Type").getValue();

                     if(supportedMimes.contains(mime){
                         handle(method.getResponseBody());
                     } else {
                         continue;
                     }
   }

the problem is that i can see that the crawler hangs up a lot of time
processing urls that are going to be ignored. so i guess it's
downloading the whole stream before ignoring it. is there a way i can
download just the header, check the content type and only then download
the stream (at the time of getResponseBody())?

See the code I'd previously referenced for an example of exactly that.

http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java

Make sure you abort the request if you skip getting the response.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to