Thanks for the response, Todd.

I'll crawl the trunk for compressor / decompressor references today. If you 
think of any other versions that should be scanned, please chime in.

    - Tim.

________________________________________
From: Todd Lipcon [t...@cloudera.com]
Sent: Sunday, December 04, 2011 10:51 PM
To: common-dev@hadoop.apache.org; Tim Broberg
Subject: Re: Compressor setInput input permanence

Hi Tim,

My guess is that this contract isn't explicitly documented anywhere.
But the good news is that the set of implementors and users of this
API is fairly well contained.

I'd propose you do the following:
- Look for any dependent projects which use the Compressor API
directly. I know HBase does. I believe Avro does. Hive and Pig might.
Accumulo probably does. A google code search or github search for
"import org.apache.hadoop.io.compress" would probably give a pretty
exhaustive list.
- Throughout those, look at see if they all maintain the buffer
between setInput and compress.
- If so, file a JIRA to document this as part of the compression API
javadoc, and then we'll be more explicit about it from now on?

-Todd


On Sat, Dec 3, 2011 at 12:18 AM, Tim Broberg <tbrob...@yahoo.com> wrote:
> The question is, how long can a Compressor count on the user buffer to stick 
> around after a call to setInput()?
>
> The Compressor object has a method, setInput whose inputs are an array 
> reference, an offset and a length.
>
> I would expect that this input would no longer be guaranteed to persist after 
> the setInput call returns.
>
> ...but in ZlibCompressor and SnappyCompressor, when there is no buffer room 
> for len bytes, the Compressor makes a copy of the reference, offset, and 
> length, clears the needsInput condition, and returns waiting for a call to 
> compress() to unload the buffers through the compressor. The Compressor 
> implementations count on the data to persist after setInput returns until 
> compress() is called.
>
> So, the data persist after the call. Does all such data persist?
>
> In theory, could a Compressor avoid a copy by just collecting references to 
> each input user buffer passed in and then sending all these references to the 
> compression library when compress() is called?
>
> ...or do these user buffers get reused before that time?
>
> By keeping references to these buffers, am I preventing them from getting 
> garbage collected and potentially soaking up large amounts of memory?
>
> Where is the persistence of the contents of these user buffers supposed to be 
> documented?
>
> TIA,
>     - Tim.



--
Todd Lipcon
Software Engineer, Cloudera

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Reply via email to