Hello Oleg, hello Ken, hello Sam, thank your very much for your help!!!
Please allow me to ask one further question. In case the DefaultHttpClient would be used on a "website-basis" (that is, I create a new Instance of the DefaultHttpClient for downloading a specific website (www.a.com) and then create a new DefaultHttpClient for a second website (www.b.com) and the DefaultHttpClient is used with the ThreadSafeClientConnManager, do I have to somehow explicitly shutdown the DefaultHttpClient? (The JavaDoc states, that when the DefaultHttpClient is used with NO explicitly set Connection Manager, then getConnectionManager().shutdown() sould be called, as it implicitly creates a SimpleConnectionManager). But is my assumption correct, that when I use the TSCCM (with the DefaultHttpClient) that I then do not have to do anything at all to leak any ressources (when I no longer require the DefaultHttpClient instance). It seems that HttpClient is a very heavy-object and maybe there are other resources I have to manually "free/shutdown"? (I very much appreciate your help and I started to refactor my application. I then however had to realize that I have the requirement to have a decidated UserAgent for every website I crawl. Using a "Shared DefaultHttpClient" (one Instance for the whole application ) with dedicated HttpContexts per Website/Thread doesn't work, as I sadly can't set the UserAgent on the HttpContext level. The UserAgent only seems to be settable on the HttpClient or HttpMethod Level. I dont know would this be a reasonable feature request/suggestion to also allow HttpParams to be set on the HttpContext level that then will take precidence over all other (already specified) paramters? Thank you very much! Jens 2010/1/28 Oleg Kalnichevski <[email protected]> > On Wed, 2010-01-27 at 20:42 +0100, Jens Mueller > [email protected] wrote: > > Hello HC Experts, > > > > I would be very greatful for an advice regarding my question. I already > > spend a lot of time searching the internet, but I am still have not found > an > > example that answers my questions. There are lot of examples available > (also > > for the multithreaded use-cases) but the only adress the use-case making > > one(!!) request. I am completely uncertain how to "best" make a series of > > requests (to the same webserver). > > > > I need to develop a simple Crawler that crawls some websites for specific > > information. The Basic idea is to download the single webpages of a > website > > (for example www.a.com) sequentially but run several of these > "sequential" > > downloaders in threads for different webpages (www.b.com and www.c.com) > in > > parallel. > > > > My current concept/implementation looks like this: > > > > 1. Instanciate a ThreadSafeClientConnManager (with a lot of default > > parameters). This connection Manager will be used/shared by all > > "DefaultHttpClient's"s > > 2. For every Webpage (of a Website, with multiple webpages), I > Instanciate > > for every(!!) webpage-request a new DefaultHttpClient and then call the > > "httpClient.execute(httpGet)" method with the instanciated > GetMethod(url). > > > > ==> I am more and more wondering if this is the correct usage of the > > DefaultHttpClient and the .execute() Method. Am I doing something wrong > > here, to instanciate a new DefaultHttpClient for every request of a > wepage? > > Or should I rather instanciate only one(!!) DefaultHttpClient and then > share > > this for the sequential .execute() calls? > > > > To be honest, what I also have not really understood yet is the Cookie > > Management. Do I as the Programmer have to instanciate the CookieStore > > manually > > 1. httpClient.setCookieStore(new BasicCookieStore()); > > and then after calling the .execute() method "get" the Cookie store > > 2. savedcookies = httpClient.getCookieStore() > > and then reinject this cookie store for the next call to the same wepage > (to > > maintain state)? > > 3. httpClient.setCookie(savedcookies) > > Or is there some implicit magic that A) does create the cookie store > > implicitly and B) somehow shares this CookieStore among the HttpClients > > and/or HttpGet's? > > > > Thank you very much!! > > Jens > > Jens, > > Re-use HttpClient instance for all execution threads but create a > separate HttpContext and CookieStore per thread of execution / > individual user, as described by Ken. > > Oleg > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
