James Turton created PARQUET-2126:
-------------------------------------

             Summary: Thread safety bug in CodecFactory
                 Key: PARQUET-2126
                 URL: https://issues.apache.org/jira/browse/PARQUET-2126
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.12.2
            Reporter: James Turton


The code for returning Compressor objects to the caller goes to some lengths to 
achieve thread safety, including keeping Codec objects in an Apache Commons 
pool that has thread-safe borrow semantics.  This is all undone by the 
BytesCompressor and BytesDecompressor Maps in 
org.apache.parquet.hadoop.CodecFactory which end up caching single compressor 
and decompressor instances due to code in CodecFactory@getCompressor and 
CodecFactory@getDecompressor.  When the caller runs multiple threads, those 
threads end up sharing compressor and decompressor instances.

For compressors based on Xerial Snappy this bug has no effect because that 
library is itself thread safe.  But when BuiltInGzipCompressor from Hadoop is 
selected for the CompressionCodecName.GZIP case, serious problems ensue.  That 
class is not thread safe and sharing one instance of it between threads 
produces both silent data corruption and JVM crashes.

To fix this situation, parquet-mr should stop caching single compressor and 
decompressor instances.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to