Aled Sage created JCLOUDS-1187:

             Summary: Avoid excessive memory usage when processing massive http 
response message
                 Key: JCLOUDS-1187
             Project: jclouds
          Issue Type: Improvement
            Reporter: Aled Sage

With jclouds 1.9.3-SNAPSHOT and 2.0.0-SNAPSHOT (and all GA versions)...

If jclouds receives a crazily big http response (e.g. see, then jclouds consumes a huge 
amount of memory while processing that response. It holds multiple copies of 
the response in-memory at the same time.

As reported in, the memory 
usage (according to {{jmap -histo:live <pid>}}) is most char arrays (which is 
what backs the StringBuilder and String). When processing a 154MB response 
payload, the memory used by char arrays goes something like this: 66MB -> 104MB 
-> 323MB -> 625MB -> 939MB -> 461MB -> 633MB -> 327MB -> 21MB.
(I don't know how much to believe the {{:live}}: is that definitely the size of 
char arrays that could not be GC'ed?)

There are two main areas where this memory is consumed.

*Wire Log*

In the jclouds wire log, the following happens (approximately?!):

* We read the response input stream, writing it to a {{FileBackedOutputStream}} 
- nice!
* In {{Wire.wire(String, InputStream)}}, we read the {{FileBackedOutputStream}} 
into a {{StringBuilder}}, and then call 
  The StringBuilder holds the {{char[]}}; the toString() duplicates it - so two 
copies in memory.  
  Unfortunately in the Softlayer example, it's all one huge line, so we logged 
it all in one go.  
  I think (but need to dig into it more) that the logging framework (slf4j -> 
log4j in my case) ends up with multiple copies as well, while processing the 
call to {{log.debug(String)}}. (Hence peaking at 939MB of char arrays in 
  When the method returns, all these copies can be GC'ed.
* The response payload has now been switched to the {{FileBackedOutputStream}}, 
so that will be used subsequently.

*Json Parsing*

To parse the HTTP response:

* The response is passed to 
* This calls {{json.fromJson(Strings2.toStringAndClose(stream), type)}}.  
* The {{Strings2.toStringAndClose}} calls {{CharStreams.toString(new 
InputStreamReader(input, Charsets.UTF_8))}}.  
  This reads the stream into a StringBuilder, then calls toString - so holds 
two copies in memory.  
  This explains the second spike in memory usage (though I'm surprised it went 
as high as 633MB of char arrays in memory).  
  When the method returns, we have our one String.  

*Possible Improvements to Wire Log*

{{org.jclouds.logging.internal.Wire}} could be configurable to only log the 
first X bytes of a response (so crazily long messages would be truncated in the 

Alternatively/additionally, {{Wire.wire}} could force splitting a huge line 
into multiple log messages when calling 
{{getWireLog().debug(buffer.toString())}}. Again this could be configurable.

In production usage, I personally would always configure it to truncate: better 
to miss the end of the response rather than risk an {{OutOfMemoryError}}. Note 
this particular part isn't an issue if jclouds.wire is set to INFO or higher.

*Possible Improvements to Json Parsing*

I think {{ParseJson.apply(InputStream stream, Type type)}} should pass the 
{{new InputStreamReader(inputStream)}} to {{org.jclouds.json.Json.fromJson()}} 
(in a try-finally to close the stream, obviously).

This would require adding a new method to the interface 

The implementation of {{GsonWrapper}} can handle that easily: it can call 
{{gson.fromJson(Reader, Type)}} instead of {{gson.fromJson(String, Type)}}. It 
looks to me like that will parse the stream as it reads it, rather than having 
to hold the whole string in memory at once.

If we do these improvements, I don't think we'll ever hold the full 154MB char 
array in memory.

This message was sent by Atlassian JIRA

Reply via email to