[ 
https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537636#comment-17537636
 ] 

ASF GitHub Bot commented on PARQUET-2126:
-----------------------------------------

theosib-amazon commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1127885617

   > My question is when a thread exits, we don't have a corresponding evict 
operation on the map. Using thread pool might be OK if the thread object is not 
changed, but not sure if there is a scenario where threads are created/exited 
quickly and we leak in that case.
   
   No matter what thread release() is called from, it will clean up all 
(de)compressors from all threads. I designed it specifically this way so that a 
leak won't happen. As long as close/release is called when it should be.
   
   Note that it's not appropriate to call close or release while 
(de)compression is still going on. If someone does that, it might still work, 
but it would be a protocol violation. The usage pattern should be:
   
   - Create Codec factory
   - Create worker threads
   - Threads create codecs
   - Threads finish using codecs
   - Threads terminate
   - The thread that created the worker threads waits until those threads are 
done
   - close/release is called.
   
   Someone might do something different, but that would be a bug no different 
from someone closing a file in one thread while it's being written to in 
another.




> Thread safety bug in CodecFactory
> ---------------------------------
>
>                 Key: PARQUET-2126
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2126
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: James Turton
>            Priority: Major
>
> The code for returning Compressor objects to the caller goes to some lengths 
> to achieve thread safety, including keeping Codec objects in an Apache 
> Commons pool that has thread-safe borrow semantics.  This is all undone by 
> the BytesCompressor and BytesDecompressor Maps in 
> org.apache.parquet.hadoop.CodecFactory which end up caching single compressor 
> and decompressor instances due to code in CodecFactory@getCompressor and 
> CodecFactory@getDecompressor.  When the caller runs multiple threads, those 
> threads end up sharing compressor and decompressor instances.
> For compressors based on Xerial Snappy this bug has no effect because that 
> library is itself thread safe.  But when BuiltInGzipCompressor from Hadoop is 
> selected for the CompressionCodecName.GZIP case, serious problems ensue.  
> That class is not thread safe and sharing one instance of it between threads 
> produces both silent data corruption and JVM crashes.
> To fix this situation, parquet-mr should stop caching single compressor and 
> decompressor instances.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to