[ 
https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571518#comment-17571518
 ] 

ASF GitHub Bot commented on PARQUET-2126:
-----------------------------------------

theosib-amazon commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1195676003

   I just thought of something that makes me nervous about this PR that 
requires further investigation. Consider the following scenario:
   - Thread A allocates a codec
   - Thread A releases the codec, which puts it into a global pool of codecs
   - Thread B allocates the same kind of codec, which comes from that same pool
   - Thread A allocates that same kind of codec again, but it gets it from the 
factory's map instead of the pool
   I'm concerned that this could result in the same codec being given to both 
threads at the same time. The solution would be to remove the codec from the 
factory's map when release() is called on the codec itself. 
   Note that this problem is not introduced by this PR, since the double 
pooling existed before. The irony is that the pool is thread-safe, while the 
factory was not.




> Thread safety bug in CodecFactory
> ---------------------------------
>
>                 Key: PARQUET-2126
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2126
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: James Turton
>            Priority: Major
>
> The code for returning Compressor objects to the caller goes to some lengths 
> to achieve thread safety, including keeping Codec objects in an Apache 
> Commons pool that has thread-safe borrow semantics.  This is all undone by 
> the BytesCompressor and BytesDecompressor Maps in 
> org.apache.parquet.hadoop.CodecFactory which end up caching single compressor 
> and decompressor instances due to code in CodecFactory@getCompressor and 
> CodecFactory@getDecompressor.  When the caller runs multiple threads, those 
> threads end up sharing compressor and decompressor instances.
> For compressors based on Xerial Snappy this bug has no effect because that 
> library is itself thread safe.  But when BuiltInGzipCompressor from Hadoop is 
> selected for the CompressionCodecName.GZIP case, serious problems ensue.  
> That class is not thread safe and sharing one instance of it between threads 
> produces both silent data corruption and JVM crashes.
> To fix this situation, parquet-mr should stop caching single compressor and 
> decompressor instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to