[jira] [Commented] (PARQUET-2126) Thread safety bug in CodecFactory

ASF GitHub Bot (Jira) Mon, 25 Jul 2022 08:52:10 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570968#comment-17570968
 ]


ASF GitHub Bot commented on PARQUET-2126:
-----------------------------------------

theosib-amazon commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1194259084

   I did some poking around. It looks like if you call release() on a codec, it 
(a) resets the codec (freeing resources, I think) and (b) returns it to a pool 
of codecs without actually destroying the codec. 
   
   Later, when release() is called on the factory, it just calls release() 
again on each of the codecs, returning them to the pool. The only other effect 
is that references are removed from a container in the factory.
   
   The only question, then, is what happens if release is called twice on a 
codec. It looks like nothing happens because CodecPool.payback() will return 
false when the codec is already in the pool. Moreover, I'm pretty sure the 
original implementation already did this.
   
   So I think the solution it to literally do nothing. The new usage pattern is 
now:
   - Create Codec factory
   - Create worker threads
   - Threads create codecs
   - Threads finish using codecs
   - Threads *optionally* call release on their codecs if they want to free 
resources right away.
   - Threads terminate
   - The thread that created the worker threads waits until those threads are 
done
   - release is called on the factory, cleaning up any codecs that were not 
released already
   




> Thread safety bug in CodecFactory
> ---------------------------------
>
>                 Key: PARQUET-2126
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2126
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: James Turton
>            Priority: Major
>
> The code for returning Compressor objects to the caller goes to some lengths 
> to achieve thread safety, including keeping Codec objects in an Apache 
> Commons pool that has thread-safe borrow semantics.  This is all undone by 
> the BytesCompressor and BytesDecompressor Maps in 
> org.apache.parquet.hadoop.CodecFactory which end up caching single compressor 
> and decompressor instances due to code in CodecFactory@getCompressor and 
> CodecFactory@getDecompressor.  When the caller runs multiple threads, those 
> threads end up sharing compressor and decompressor instances.
> For compressors based on Xerial Snappy this bug has no effect because that 
> library is itself thread safe.  But when BuiltInGzipCompressor from Hadoop is 
> selected for the CompressionCodecName.GZIP case, serious problems ensue.  
> That class is not thread safe and sharing one instance of it between threads 
> produces both silent data corruption and JVM crashes.
> To fix this situation, parquet-mr should stop caching single compressor and 
> decompressor instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2126) Thread safety bug in CodecFactory

Reply via email to