Re: getting only the header

Ken Krugler Tue, 26 Jan 2010 05:42:47 -0800


On Jan 26, 2010, at 3:54am, Claudio Martella wrote:

As I mentioned in the previous post, i'm using httpclient for a
webcrawler i'm writing. at the moment i'm doing something like this:


   while(toVisit.size() > 0){

                     client.execute(method);
                     String mime = getContentType(method); // which
does method.getResponseHeader("Content-Type").getValue();

                     if(supportedMimes.contains(mime){
                         handle(method.getResponseBody());
                     } else {
                         continue;
                     }
   }

the problem is that i can see that the crawler hangs up a lot of time
processing urls that are going to be ignored. so i guess it's
downloading the whole stream before ignoring it. is there a way i can

download just the header, check the content type and only thendownload

the stream (at the time of getResponseBody())?


See the code I'd previously referenced for an example of exactly that.

http://github.com/bixo/bixo/blob/master/src/main/java/bixo/fetcher/http/SimpleHttpFetcher.java

Make sure you abort the request if you skip getting the response.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: getting only the header

Reply via email to