Hi Tim, My guess is that this contract isn't explicitly documented anywhere. But the good news is that the set of implementors and users of this API is fairly well contained.
I'd propose you do the following: - Look for any dependent projects which use the Compressor API directly. I know HBase does. I believe Avro does. Hive and Pig might. Accumulo probably does. A google code search or github search for "import org.apache.hadoop.io.compress" would probably give a pretty exhaustive list. - Throughout those, look at see if they all maintain the buffer between setInput and compress. - If so, file a JIRA to document this as part of the compression API javadoc, and then we'll be more explicit about it from now on? -Todd On Sat, Dec 3, 2011 at 12:18 AM, Tim Broberg <tbrob...@yahoo.com> wrote: > The question is, how long can a Compressor count on the user buffer to stick > around after a call to setInput()? > > The Compressor object has a method, setInput whose inputs are an array > reference, an offset and a length. > > I would expect that this input would no longer be guaranteed to persist after > the setInput call returns. > > ...but in ZlibCompressor and SnappyCompressor, when there is no buffer room > for len bytes, the Compressor makes a copy of the reference, offset, and > length, clears the needsInput condition, and returns waiting for a call to > compress() to unload the buffers through the compressor. The Compressor > implementations count on the data to persist after setInput returns until > compress() is called. > > So, the data persist after the call. Does all such data persist? > > In theory, could a Compressor avoid a copy by just collecting references to > each input user buffer passed in and then sending all these references to the > compression library when compress() is called? > > ...or do these user buffers get reused before that time? > > By keeping references to these buffers, am I preventing them from getting > garbage collected and potentially soaking up large amounts of memory? > > Where is the persistence of the contents of these user buffers supposed to be > documented? > > TIA, > - Tim. -- Todd Lipcon Software Engineer, Cloudera