[
https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552295#comment-17552295
]
ASF GitHub Bot commented on PARQUET-2126:
-----------------------------------------
shangxinli commented on PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#issuecomment-1151340364
Thanks for addressing the feedback! What I meant was that ideally when
'Threads terminate' happens, it should clean up the compressor/decompressor
immediately. I understand we won't leak in the end of 'close/release is called'
though.
> Seems good to me (non-binding!). Revisiting whether or not the caching
strategy make sense might be worthwhile, but that shouldn't stop this fix.
>
> Small comment: I would remove most of the references to the JIRA ticket as
well as descriptions of the old behavior. I think the comment that describes
the new behavior and why it might be unintuitive with a reference to the JIRA
makes sense though. I'll defer to others in the project (again, I'm not a
committer) if there are existing standards for this though.
@dossett, we don't have a standard like that. It seems OK to have. What do
you think?
> Thread safety bug in CodecFactory
> ---------------------------------
>
> Key: PARQUET-2126
> URL: https://issues.apache.org/jira/browse/PARQUET-2126
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.12.2
> Reporter: James Turton
> Priority: Major
>
> The code for returning Compressor objects to the caller goes to some lengths
> to achieve thread safety, including keeping Codec objects in an Apache
> Commons pool that has thread-safe borrow semantics. This is all undone by
> the BytesCompressor and BytesDecompressor Maps in
> org.apache.parquet.hadoop.CodecFactory which end up caching single compressor
> and decompressor instances due to code in CodecFactory@getCompressor and
> CodecFactory@getDecompressor. When the caller runs multiple threads, those
> threads end up sharing compressor and decompressor instances.
> For compressors based on Xerial Snappy this bug has no effect because that
> library is itself thread safe. But when BuiltInGzipCompressor from Hadoop is
> selected for the CompressionCodecName.GZIP case, serious problems ensue.
> That class is not thread safe and sharing one instance of it between threads
> produces both silent data corruption and JVM crashes.
> To fix this situation, parquet-mr should stop caching single compressor and
> decompressor instances.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)