Re: Using HttpClient to mimic a POST network request

2009-09-24 Thread Yan Cheng Cheok
I realize the reason why it doesn't work is because,

HttpClient perform URL encoding explicitly on my payload. If I try to 

try {
Socket socket = new Socket(www.xxx.com, 2);
PrintWriter out = new PrintWriter(socket.getOutputStream(), true);
final String body = 
[SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL;
final int length = body.length();
final String s = POST 
/%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? 
HTTP/1.1\r\nContent-Type: application/x-www-form-urlencoded\r\nCache-Control: 
no-cache\r\nPragma: no-cache\r\nUser-Agent: Mozilla/4.0 (Windows XP 5.1) 
Java/1.6.0_06\r\nHost: www.xxx.com:2\r\nAccept: text/html, image/gif, 
image/jpeg, *; q=.2, */*; q=.2\r\nConnection: keep-alive\r\nContent-Length: 
+length+\r\n\r\n + body;
out.println(s);

 BufferedReader in = new BufferedReader(new 
InputStreamReader(socket.getInputStream()));
 while(true) {
  String ss = in.readLine();
  if (ss == null) break;
System.out.println(ss);
 }

}
catch (Exception exp) {
}

It will work with payload :

  - payload: HttpContentType =  application/x-www-form-urlencoded
 [SORT]: 0,1,0,10,5,0,KL,0
 [FIELD]: 
33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82
 [LIST]: 1155.KL,1295.KL,7191.KL,0097.KL,2267.KL

May I know how can I disable URL encoding on my POST payload?

Thanks and Regards
Yan Cheng Cheok


  


-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



Is HttpClient suitable for the following task?

2009-09-24 Thread Yan Cheng Cheok
I try to talk to a server, by telneting to it, and send the following command 
through telnet terminal :




POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 164

[SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL



This works very fine. Now, I wish I can use HttpClient, to talk to the server, 
as I use telnet to talk to the server. The reason I wish to use HttpClient, 
instead of using raw TCP socket, is because HttpClient does support NTLM.

However, when I use POST method with NameValuePair :

new NameValuePair([SORT], 0,1,0,10,5,0,KL,0)

The request will become URL encoded. The server doesn't understand URL encoded 
request.

%5BSORT%5D: 0%2C1%2C0%2C10%2C5%2C0%2CKL%2C0

Is there any way I can avoid this?

Thanks and Regards
Yan Cheng Cheok


  


-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



BasicAuthentication using HttpClient 4.0

2009-09-24 Thread kash_meu

Hi,

I am trying to send a GET request to
http://abcdef.4y3...@hostname.com/abc.php I am not sure how to send request
to this url. I am trying like this.

HttpGet httpGet = new HttpGet(http://abcdef.4y3...@hostname.com/abc.php;);

But this does not seem to be working. Any ideas?

Regards,
Kashif
-- 
View this message in context: 
http://www.nabble.com/BasicAuthentication-using-HttpClient-4.0-tp25578027p25578027.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


Re: Parallel Webcrawler Implementation

2009-09-24 Thread Ken Krugler

Hi Tobi,

First, I'd suggest getting and reading through the sources of existing  
Java-based web crawlers. They all use HttpClient, and thus would  
provide much useful example code:


Nutch (Apache)
Droids (Apache)
Heritrix (Archive)
Bixo (http://bixo.101tec.com)

Some comments below:

On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:


Hi Guys,

I am working on a parallel webcrawler implementation in Java. I  
could use some help with some design question and a bug that takes  
my sleep ;-)


First thing, this is my design: I have a list, which stores URL's  
that have been crawled already. Furhter I have a Queue which is  
responsible to provide the crawler with the next URL to fetch. Then  
I have a ThreadController which spawns new crawler-threads until a  
maximum number is reached. Finally there are crawler-threads that  
process a URL given by the queue. They work until the queue size is  
zero and then the system stops.


Following is my question: I am using (basically) the following  
statements. As I am new to httpclient this could probably a dump  
approach, and I am happy for feedback.


snip from WebCrawlerThread
DefaultHttpClient client;
HttpGet get;

  public run() {
  client = new DefaultHttpClient();
  HttpResponse response = client.execute(get);
  HttpEntity entity = response.getEntity();
  String mimetype = entity.getContentType().getValue();
  String rawPage = EntityUtils.toString(entity);
  client.getConnectionManager().shutdown();

 (...) doing crawler things
  }
/snap

First thing: Is the thread the right place to host the client  
object, or should it be shared?


You should use the ThreadSafeClientConnManager, and reuse the same  
DefaultHttpClient instance for all threads.


See the init() method of Bixo's SimpleHttpFetcher class for an example  
of setting this up.


Second: Would it enhance performance if I reuse the connection  
somehow?


Yes, via keep-alive. Though you then have to be a bit more careful  
about handling stale connections (ones that the server has shut down).


Again, take a look at the Bixo SimpleHttpFetcher class for some code  
that tries (at least) to do this properly.


And most important the bug: With increasing number of pages I  
receive zillions of


java.net.BindException: Address already in use: connect


No idea, sorry.

But I think that by default HttpClient limits the number of parallel  
request to one host to be two. Not sure if that would be a factor in  
your case, given how you're creating a new client for each request.


-- Ken



--
Ken Krugler
TransPac Software, Inc.
http://www.transpac.com
+1 530-210-6378



Re: Parallel Webcrawler Implementation

2009-09-24 Thread sebb
On 24/09/2009, Ken Krugler kkrugler_li...@transpac.com wrote:
 Hi Tobi,

  First, I'd suggest getting and reading through the sources of existing
 Java-based web crawlers. They all use HttpClient, and thus would provide
 much useful example code:

  Nutch (Apache)
  Droids (Apache)
  Heritrix (Archive)
  Bixo (http://bixo.101tec.com)

  Some comments below:

  On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:


  Hi Guys,
 
  I am working on a parallel webcrawler implementation in Java. I could use
 some help with some design question and a bug that takes my sleep ;-)
 
  First thing, this is my design: I have a list, which stores URL's that
 have been crawled already. Furhter I have a Queue which is responsible to
 provide the crawler with the next URL to fetch. Then I have a
 ThreadController which spawns new crawler-threads until a maximum number is
 reached. Finally there are crawler-threads that process a URL given by the
 queue. They work until the queue size is zero and then the system stops.
 
  Following is my question: I am using (basically) the following statements.
 As I am new to httpclient this could probably a dump approach, and I am
 happy for feedback.
 
  snip from WebCrawlerThread
  DefaultHttpClient client;
  HttpGet get;
 
   public run() {
   client = new DefaultHttpClient();
   HttpResponse response = client.execute(get);
   HttpEntity entity = response.getEntity();
   String mimetype =
 entity.getContentType().getValue();
   String rawPage = EntityUtils.toString(entity);
   client.getConnectionManager().shutdown();
 
  (...) doing crawler things
   }
  /snap
 
  First thing: Is the thread the right place to host the client object, or
 should it be shared?
 

  You should use the ThreadSafeClientConnManager, and reuse the same
 DefaultHttpClient instance for all threads.

  See the init() method of Bixo's SimpleHttpFetcher class for an example of
 setting this up.


  Second: Would it enhance performance if I reuse the connection somehow?
 

  Yes, via keep-alive. Though you then have to be a bit more careful about
 handling stale connections (ones that the server has shut down).

  Again, take a look at the Bixo SimpleHttpFetcher class for some code that
 tries (at least) to do this properly.


  And most important the bug: With increasing number of pages I receive
 zillions of
 
  java.net.BindException: Address already in use: connect
 

I've seen this error generated when a WinXP host runs out of sockets.
i.e. the message is misleading in this case.

  No idea, sorry.

  But I think that by default HttpClient limits the number of parallel
 request to one host to be two. Not sure if that would be a factor in your
 case, given how you're creating a new client for each request.

  -- Ken



  --
  Ken Krugler
  TransPac Software, Inc.
  http://www.transpac.com
  +1 530-210-6378



-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



Re: Using HttpClient to mimic a POST network request

2009-09-24 Thread Oleg Kalnichevski

Yan Cheng Cheok wrote:

I realize the reason why it doesn't work is because,

HttpClient perform URL encoding explicitly on my payload. If I try to 


try {
Socket socket = new Socket(www.xxx.com, 2);
PrintWriter out = new PrintWriter(socket.getOutputStream(), true);
final String body = 
[SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL;
final int length = body.length();
final String s = POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? 
HTTP/1.1\r\nContent-Type: application/x-www-form-urlencoded\r\nCache-Control: no-cache\r\nPragma: 
no-cache\r\nUser-Agent: Mozilla/4.0 (Windows XP 5.1) Java/1.6.0_06\r\nHost: 
www.xxx.com:2\r\nAccept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\nConnection: 
keep-alive\r\nContent-Length: +length+\r\n\r\n + body;
out.println(s);

 BufferedReader in = new BufferedReader(new 
InputStreamReader(socket.getInputStream()));
 while(true) {
  String ss = in.readLine();
  if (ss == null) break;
System.out.println(ss);
 }

}
catch (Exception exp) {
}

It will work with payload :

  - payload: HttpContentType =  application/x-www-form-urlencoded
 [SORT]: 0,1,0,10,5,0,KL,0
 [FIELD]: 
33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82
 [LIST]: 1155.KL,1295.KL,7191.KL,0097.KL,2267.KL

May I know how can I disable URL encoding on my POST payload?



You should implement a custom RequestEntity class and encode the request 
entity any way you please.


Oleg



Thanks and Regards
Yan Cheng Cheok


  



-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org




-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



Re: Is HttpClient suitable for the following task?

2009-09-24 Thread Oleg Kalnichevski

Yan Cheng Cheok wrote:

I try to talk to a server, by telneting to it, and send the following command 
through telnet terminal :




POST /%5bvUpJYKw4QvGRMBmhATUxRwv4JrU9aDnwNEuangVyy6OuHxi2YiY=%5dImage? HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Content-Length: 164

[SORT]=0,1,0,10,5,0,KL,0[FIELD]=33,38,51,58,68,88,78,98,99,101,56,57,69,70,71,72,89,90,91,92,59,60,61,62,79,80,81,82[LIST]=1155.KL,1295.KL,7191.KL,0097.KL,2267.KL



This works very fine. Now, I wish I can use HttpClient, to talk to the server, 
as I use telnet to talk to the server. The reason I wish to use HttpClient, 
instead of using raw TCP socket, is because HttpClient does support NTLM.

However, when I use POST method with NameValuePair :

new NameValuePair([SORT], 0,1,0,10,5,0,KL,0)

The request will become URL encoded. The server doesn't understand URL encoded 
request.

%5BSORT%5D: 0%2C1%2C0%2C10%2C5%2C0%2CKL%2C0

Is there any way I can avoid this?

Thanks and Regards
Yan Cheng Cheok




Yes, there is. See my previous post.

Oleg


  



-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org




-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



Re: Parallel Webcrawler Implementation

2009-09-24 Thread Oleg Kalnichevski

sebb wrote:

On 24/09/2009, Ken Krugler kkrugler_li...@transpac.com wrote:

Hi Tobi,

 First, I'd suggest getting and reading through the sources of existing
Java-based web crawlers. They all use HttpClient, and thus would provide
much useful example code:

 Nutch (Apache)
 Droids (Apache)
 Heritrix (Archive)
 Bixo (http://bixo.101tec.com)

 Some comments below:

 On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:



Hi Guys,

I am working on a parallel webcrawler implementation in Java. I could use

some help with some design question and a bug that takes my sleep ;-)

First thing, this is my design: I have a list, which stores URL's that

have been crawled already. Furhter I have a Queue which is responsible to
provide the crawler with the next URL to fetch. Then I have a
ThreadController which spawns new crawler-threads until a maximum number is
reached. Finally there are crawler-threads that process a URL given by the
queue. They work until the queue size is zero and then the system stops.

Following is my question: I am using (basically) the following statements.

As I am new to httpclient this could probably a dump approach, and I am
happy for feedback.

snip from WebCrawlerThread
DefaultHttpClient client;
HttpGet get;

 public run() {
 client = new DefaultHttpClient();
 HttpResponse response = client.execute(get);
 HttpEntity entity = response.getEntity();
 String mimetype =

entity.getContentType().getValue();

 String rawPage = EntityUtils.toString(entity);
 client.getConnectionManager().shutdown();

(...) doing crawler things
 }
/snap

First thing: Is the thread the right place to host the client object, or

should it be shared?
 You should use the ThreadSafeClientConnManager, and reuse the same
DefaultHttpClient instance for all threads.

 See the init() method of Bixo's SimpleHttpFetcher class for an example of
setting this up.



Second: Would it enhance performance if I reuse the connection somehow?


 Yes, via keep-alive. Though you then have to be a bit more careful about
handling stale connections (ones that the server has shut down).

 Again, take a look at the Bixo SimpleHttpFetcher class for some code that
tries (at least) to do this properly.



And most important the bug: With increasing number of pages I receive

zillions of

java.net.BindException: Address already in use: connect



I've seen this error generated when a WinXP host runs out of sockets.
i.e. the message is misleading in this case.



Which is hardly surprising given that the crawler creates a new 
connection per EACH request / link.


Oleg




 No idea, sorry.

 But I think that by default HttpClient limits the number of parallel
request to one host to be two. Not sure if that would be a factor in your
case, given how you're creating a new client for each request.

 -- Ken



 --
 Ken Krugler
 TransPac Software, Inc.
 http://www.transpac.com
 +1 530-210-6378




-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org




-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org



Re: BasicAuthentication using HttpClient 4.0

2009-09-24 Thread kash_meu



Tobias N. Sasse wrote:
 
 You should provide an error message..
 

Thanks. I am not getting any error messages. But the output is not right.
When I paste http://abcdef.4y3...@hostname.com/abc.php in the browser, I get
the right output. But on running the client, the output is not what I am
expecting


-- 
View this message in context: 
http://www.nabble.com/BasicAuthentication-using-HttpClient-4.0-tp25578027p25605235.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
For additional commands, e-mail: httpclient-users-h...@hc.apache.org