Re: Using HttpClient to mimic a POST network request
I realize the reason why it doesn't work is because, HttpClient perform URL encoding explicitly on my payload. If I try to try { Socket socket = new Socket(www.xxx.com, 2); PrintWriter out = new PrintWriter(socket.getOutputStream(), true); final String body = [SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL; final int length = body.length(); final String s = POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1\r\nContent-Type: application/x-www-form-urlencoded\r\nCache-Control: no-cache\r\nPragma: no-cache\r\nUser-Agent: Mozilla/4.0 (Windows XP 5.1) Java/1.6.0_06\r\nHost: www.xxx.com:2\r\nAccept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\nConnection: keep-alive\r\nContent-Length: +length+\r\n\r\n + body; out.println(s); BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream())); while(true) { String ss = in.readLine(); if (ss == null) break; System.out.println(ss); } } catch (Exception exp) { } It will work with payload : - payload: HttpContentType = application/x-www-form-urlencoded [SORT]: 0,1,0,10,5,0,KL,0 [FIELD]: 33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82 [LIST]: 1155.KL,1295.KL,7191.KL,0097.KL,2267.KL May I know how can I disable URL encoding on my POST payload? Thanks and Regards Yan Cheng Cheok - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
Is HttpClient suitable for the following task?
I try to talk to a server, by telneting to it, and send the following command through telnet terminal : POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1 Content-Type: application/x-www-form-urlencoded Content-Length: 164 [SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL This works very fine. Now, I wish I can use HttpClient, to talk to the server, as I use telnet to talk to the server. The reason I wish to use HttpClient, instead of using raw TCP socket, is because HttpClient does support NTLM. However, when I use POST method with NameValuePair : new NameValuePair([SORT], 0,1,0,10,5,0,KL,0) The request will become URL encoded. The server doesn't understand URL encoded request. %5BSORT%5D: 0%2C1%2C0%2C10%2C5%2C0%2CKL%2C0 Is there any way I can avoid this? Thanks and Regards Yan Cheng Cheok - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
BasicAuthentication using HttpClient 4.0
Hi, I am trying to send a GET request to http://abcdef.4y3...@hostname.com/abc.php I am not sure how to send request to this url. I am trying like this. HttpGet httpGet = new HttpGet(http://abcdef.4y3...@hostname.com/abc.php;); But this does not seem to be working. Any ideas? Regards, Kashif -- View this message in context: http://www.nabble.com/BasicAuthentication-using-HttpClient-4.0-tp25578027p25578027.html Sent from the HttpClient-User mailing list archive at Nabble.com.
Re: Parallel Webcrawler Implementation
Hi Tobi, First, I'd suggest getting and reading through the sources of existing Java-based web crawlers. They all use HttpClient, and thus would provide much useful example code: Nutch (Apache) Droids (Apache) Heritrix (Archive) Bixo (http://bixo.101tec.com) Some comments below: On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote: Hi Guys, I am working on a parallel webcrawler implementation in Java. I could use some help with some design question and a bug that takes my sleep ;-) First thing, this is my design: I have a list, which stores URL's that have been crawled already. Furhter I have a Queue which is responsible to provide the crawler with the next URL to fetch. Then I have a ThreadController which spawns new crawler-threads until a maximum number is reached. Finally there are crawler-threads that process a URL given by the queue. They work until the queue size is zero and then the system stops. Following is my question: I am using (basically) the following statements. As I am new to httpclient this could probably a dump approach, and I am happy for feedback. snip from WebCrawlerThread DefaultHttpClient client; HttpGet get; public run() { client = new DefaultHttpClient(); HttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); String mimetype = entity.getContentType().getValue(); String rawPage = EntityUtils.toString(entity); client.getConnectionManager().shutdown(); (...) doing crawler things } /snap First thing: Is the thread the right place to host the client object, or should it be shared? You should use the ThreadSafeClientConnManager, and reuse the same DefaultHttpClient instance for all threads. See the init() method of Bixo's SimpleHttpFetcher class for an example of setting this up. Second: Would it enhance performance if I reuse the connection somehow? Yes, via keep-alive. Though you then have to be a bit more careful about handling stale connections (ones that the server has shut down). Again, take a look at the Bixo SimpleHttpFetcher class for some code that tries (at least) to do this properly. And most important the bug: With increasing number of pages I receive zillions of java.net.BindException: Address already in use: connect No idea, sorry. But I think that by default HttpClient limits the number of parallel request to one host to be two. Not sure if that would be a factor in your case, given how you're creating a new client for each request. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378
Re: Parallel Webcrawler Implementation
On 24/09/2009, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tobi, First, I'd suggest getting and reading through the sources of existing Java-based web crawlers. They all use HttpClient, and thus would provide much useful example code: Nutch (Apache) Droids (Apache) Heritrix (Archive) Bixo (http://bixo.101tec.com) Some comments below: On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote: Hi Guys, I am working on a parallel webcrawler implementation in Java. I could use some help with some design question and a bug that takes my sleep ;-) First thing, this is my design: I have a list, which stores URL's that have been crawled already. Furhter I have a Queue which is responsible to provide the crawler with the next URL to fetch. Then I have a ThreadController which spawns new crawler-threads until a maximum number is reached. Finally there are crawler-threads that process a URL given by the queue. They work until the queue size is zero and then the system stops. Following is my question: I am using (basically) the following statements. As I am new to httpclient this could probably a dump approach, and I am happy for feedback. snip from WebCrawlerThread DefaultHttpClient client; HttpGet get; public run() { client = new DefaultHttpClient(); HttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); String mimetype = entity.getContentType().getValue(); String rawPage = EntityUtils.toString(entity); client.getConnectionManager().shutdown(); (...) doing crawler things } /snap First thing: Is the thread the right place to host the client object, or should it be shared? You should use the ThreadSafeClientConnManager, and reuse the same DefaultHttpClient instance for all threads. See the init() method of Bixo's SimpleHttpFetcher class for an example of setting this up. Second: Would it enhance performance if I reuse the connection somehow? Yes, via keep-alive. Though you then have to be a bit more careful about handling stale connections (ones that the server has shut down). Again, take a look at the Bixo SimpleHttpFetcher class for some code that tries (at least) to do this properly. And most important the bug: With increasing number of pages I receive zillions of java.net.BindException: Address already in use: connect I've seen this error generated when a WinXP host runs out of sockets. i.e. the message is misleading in this case. No idea, sorry. But I think that by default HttpClient limits the number of parallel request to one host to be two. Not sure if that would be a factor in your case, given how you're creating a new client for each request. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378 - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
Re: Using HttpClient to mimic a POST network request
Yan Cheng Cheok wrote: I realize the reason why it doesn't work is because, HttpClient perform URL encoding explicitly on my payload. If I try to try { Socket socket = new Socket(www.xxx.com, 2); PrintWriter out = new PrintWriter(socket.getOutputStream(), true); final String body = [SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL; final int length = body.length(); final String s = POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1\r\nContent-Type: application/x-www-form-urlencoded\r\nCache-Control: no-cache\r\nPragma: no-cache\r\nUser-Agent: Mozilla/4.0 (Windows XP 5.1) Java/1.6.0_06\r\nHost: www.xxx.com:2\r\nAccept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\nConnection: keep-alive\r\nContent-Length: +length+\r\n\r\n + body; out.println(s); BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream())); while(true) { String ss = in.readLine(); if (ss == null) break; System.out.println(ss); } } catch (Exception exp) { } It will work with payload : - payload: HttpContentType = application/x-www-form-urlencoded [SORT]: 0,1,0,10,5,0,KL,0 [FIELD]: 33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82 [LIST]: 1155.KL,1295.KL,7191.KL,0097.KL,2267.KL May I know how can I disable URL encoding on my POST payload? You should implement a custom RequestEntity class and encode the request entity any way you please. Oleg Thanks and Regards Yan Cheng Cheok - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
Re: Is HttpClient suitable for the following task?
Yan Cheng Cheok wrote: I try to talk to a server, by telneting to it, and send the following command through telnet terminal : POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1 Content-Type: application/x-www-form-urlencoded Content-Length: 164 [SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL This works very fine. Now, I wish I can use HttpClient, to talk to the server, as I use telnet to talk to the server. The reason I wish to use HttpClient, instead of using raw TCP socket, is because HttpClient does support NTLM. However, when I use POST method with NameValuePair : new NameValuePair([SORT], 0,1,0,10,5,0,KL,0) The request will become URL encoded. The server doesn't understand URL encoded request. %5BSORT%5D: 0%2C1%2C0%2C10%2C5%2C0%2CKL%2C0 Is there any way I can avoid this? Thanks and Regards Yan Cheng Cheok Yes, there is. See my previous post. Oleg - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
Re: Parallel Webcrawler Implementation
sebb wrote: On 24/09/2009, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Tobi, First, I'd suggest getting and reading through the sources of existing Java-based web crawlers. They all use HttpClient, and thus would provide much useful example code: Nutch (Apache) Droids (Apache) Heritrix (Archive) Bixo (http://bixo.101tec.com) Some comments below: On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote: Hi Guys, I am working on a parallel webcrawler implementation in Java. I could use some help with some design question and a bug that takes my sleep ;-) First thing, this is my design: I have a list, which stores URL's that have been crawled already. Furhter I have a Queue which is responsible to provide the crawler with the next URL to fetch. Then I have a ThreadController which spawns new crawler-threads until a maximum number is reached. Finally there are crawler-threads that process a URL given by the queue. They work until the queue size is zero and then the system stops. Following is my question: I am using (basically) the following statements. As I am new to httpclient this could probably a dump approach, and I am happy for feedback. snip from WebCrawlerThread DefaultHttpClient client; HttpGet get; public run() { client = new DefaultHttpClient(); HttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); String mimetype = entity.getContentType().getValue(); String rawPage = EntityUtils.toString(entity); client.getConnectionManager().shutdown(); (...) doing crawler things } /snap First thing: Is the thread the right place to host the client object, or should it be shared? You should use the ThreadSafeClientConnManager, and reuse the same DefaultHttpClient instance for all threads. See the init() method of Bixo's SimpleHttpFetcher class for an example of setting this up. Second: Would it enhance performance if I reuse the connection somehow? Yes, via keep-alive. Though you then have to be a bit more careful about handling stale connections (ones that the server has shut down). Again, take a look at the Bixo SimpleHttpFetcher class for some code that tries (at least) to do this properly. And most important the bug: With increasing number of pages I receive zillions of java.net.BindException: Address already in use: connect I've seen this error generated when a WinXP host runs out of sockets. i.e. the message is misleading in this case. Which is hardly surprising given that the crawler creates a new connection per EACH request / link. Oleg No idea, sorry. But I think that by default HttpClient limits the number of parallel request to one host to be two. Not sure if that would be a factor in your case, given how you're creating a new client for each request. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378 - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org
Re: BasicAuthentication using HttpClient 4.0
Tobias N. Sasse wrote: You should provide an error message.. Thanks. I am not getting any error messages. But the output is not right. When I paste http://abcdef.4y3...@hostname.com/abc.php in the browser, I get the right output. But on running the client, the output is not what I am expecting -- View this message in context: http://www.nabble.com/BasicAuthentication-using-HttpClient-4.0-tp25578027p25605235.html Sent from the HttpClient-User mailing list archive at Nabble.com. - To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org For additional commands, e-mail: httpclient-users-h...@hc.apache.org