[ https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526453#comment-17526453 ]
ASF GitHub Bot commented on PARQUET-2126: ----------------------------------------- dossett commented on PR #959: URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1106556192 Seems good to me (non-binding!). Revisiting whether or not the caching strategy make sense might be worthwhile, but that shouldn't stop this fix. Small comment: I would remove most of the references to the JIRA ticket as well as descriptions of the old behavior. I think the comment that describes the new behavior and why it might be unintuitive with a reference to the JIRA makes sense though. I'll defer to others in the project (again, I'm not a committer) if there are existing standards for this though. > Thread safety bug in CodecFactory > --------------------------------- > > Key: PARQUET-2126 > URL: https://issues.apache.org/jira/browse/PARQUET-2126 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.12.2 > Reporter: James Turton > Priority: Major > > The code for returning Compressor objects to the caller goes to some lengths > to achieve thread safety, including keeping Codec objects in an Apache > Commons pool that has thread-safe borrow semantics. This is all undone by > the BytesCompressor and BytesDecompressor Maps in > org.apache.parquet.hadoop.CodecFactory which end up caching single compressor > and decompressor instances due to code in CodecFactory@getCompressor and > CodecFactory@getDecompressor. When the caller runs multiple threads, those > threads end up sharing compressor and decompressor instances. > For compressors based on Xerial Snappy this bug has no effect because that > library is itself thread safe. But when BuiltInGzipCompressor from Hadoop is > selected for the CompressionCodecName.GZIP case, serious problems ensue. > That class is not thread safe and sharing one instance of it between threads > produces both silent data corruption and JVM crashes. > To fix this situation, parquet-mr should stop caching single compressor and > decompressor instances. -- This message was sent by Atlassian Jira (v8.20.7#820007)