On Jan 29, 2010, at 3:35am, sebb wrote:

On 29/01/2010, Ken Krugler <[email protected]> wrote:

On Jan 28, 2010, at 10:09pm, amoldavsky wrote:



Hi Oleg,
Thank you for the quick reply.

So if there is a possibility that not the whole buffer is filled how can I insure or force HttpClient to fill the whole buffer? Should I maybe avoid
Stream Readers all together?


If bufferSize is X, and the server document you're fetching has Y bytes,
then what do you mean by "force HttpClient to fill the whole buffer"?

At a minimum, you'd want

int bytesRead = chunkedIns.read(tmp);
if (bytesRead != -1) {
  return new String(tmp, 0, bytesRead);
}

But that also uses the platform default encoding for the character set,
which often won't be correct.

However, if the user just wants to create a file with the contents of
the response, then surely there is no need to mess with encodings?
Just write the bytes to a file output stream without any conversion.

Good point that streaming to disk avoids dealing with this issue up- front.

Note though that you lose the context from the response headers that is often used to determine the correct encoding, for when you actually do decide to process the data. That's one reason why the arc/warc file formats (used for storing web crawl data) include the headers.

--Ken


On Wed, 2010-01-27 at 20:24 -0800, amoldavsky wrote:

Hi

I have coded a simple file downloader using HttpClient 4.0.
It works fine but there is something wrong with the String encoding or
the
buffer stream. The problem is that there are long sequences of "NULL"
(ANSI
code 00) through out the final file, like this:

http://old.nabble.com/file/p27350930/httpclient_error01.jpg

http://old.nabble.com/file/p27350930/httpclient_error02.jpg

Here is the main code:

public String getChunk(String url, int bufferSize) throws
HTTPClientException
{
 if(!chunkedStarted)
 {
   chunkedIns = getInputStream(url);
   chunkedStarted = true;
 }

 byte[] tmp = new byte[bufferSize];
 try
 {
   if(chunkedIns.read(tmp) != -1)
   {


What makes you think that the entire buffer will be filled with data?

Oleg



     return new String(tmp);
   }
   else
   {
     finish();
     return null;
   }
 }
 catch(IOException e)
 {
   HTTPClientException e2 = new
HTTPClientException(e.getMessage());
   e2.setStackTrace(e.getStackTrace());
   throw e2;
 }
}

public void finish()
{
 // do some cleaning
}

private InputStream getInputStream(String url) throws
HTTPClientException
{
 InputStream instream = null;

 httpClient = new DefaultHttpClient();

httpClient.getParams().setParameter("http.useragent",
AGENT_NAME);

 HttpGet httpGet = new HttpGet(url);
 HttpResponse response = null;

 try
 {
   response = httpClient.execute(httpGet);
   HttpEntity entity = response.getEntity();

   if(entity != null)
   {
     instream = entity.getContent();
   }
 }
 catch(ClientProtocolException e)
 {
   HTTPClientException e2 = new
HTTPClientException(e.getMessage());
   e2.setStackTrace(e.getStackTrace());
   throw e2;
 }
 catch(IOException e)
 {
   HTTPClientException e2 = new
HTTPClientException(e.getMessage());
   e2.setStackTrace(e.getStackTrace());
   throw e2;
 }

 return instream;
}

getChuck and getInputStream can basically be one method but I just
have
the
need to split them for internal conveniece, that does not change the
funtionality as a whole.

It seems like either the conversion from bytes to string is a problem:
return new String(tmp);

or that the buffer is not getting filled to the end. The latter could
not
be
possible because the files are ~30MB each and the buffer size is 2Kb.

I have attached the file, it's a CSV (shortened to ~6KB), note that
long
white space between some of the URLs, if you just remove it, the URL
makes
sense.
http://old.nabble.com/file/p27350930/datafeed.csv
datafeed.csv

Where can this white space come (null) from??

thank!





---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]
For additional commands, e-mail:
[email protected]





--
View this message in context:
http://old.nabble.com/HttpClient-4.0-encoding-madness-tp27350930p27366928.html
Sent from the HttpClient-User mailing list archive at Nabble.com.



---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]
For additional commands, e-mail:
[email protected]



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to