[ 
https://issues.apache.org/jira/browse/PARQUET-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611463#comment-17611463
 ] 

ASF GitHub Bot commented on PARQUET-2126:
-----------------------------------------

ggershinsky commented on code in PR #959:
URL: https://github.com/apache/parquet-mr/pull/959#discussion_r984394753


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java:
##########
@@ -244,16 +272,60 @@ protected CompressionCodec getCodec(CompressionCodecName 
codecName) {
     }
   }
 
+  /**
+   * Modified for https://issues.apache.org/jira/browse/PARQUET-2126
+   * This releases all cached instances of all compressors and decompressors 
created by all threads that share
+   * this CodeFactory instance.
+   * Note: A problem might occur if release() were called while some codec 
instances were still in use, but it
+   * would not make sense to call close() or release() on a shared 
CodecFactory while some threads are still
+   * actively using it. The usage pattern should be:

Review Comment:
   It would be also good to have a look at an example of such callers, eg in 
Apache Spark, to start analyzing the implications.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java:
##########
@@ -244,16 +272,60 @@ protected CompressionCodec getCodec(CompressionCodecName 
codecName) {
     }
   }
 
+  /**
+   * Modified for https://issues.apache.org/jira/browse/PARQUET-2126
+   * This releases all cached instances of all compressors and decompressors 
created by all threads that share
+   * this CodeFactory instance.
+   * Note: A problem might occur if release() were called while some codec 
instances were still in use, but it
+   * would not make sense to call close() or release() on a shared 
CodecFactory while some threads are still
+   * actively using it. The usage pattern should be:

Review Comment:
   regarding the code parts that call this release() method. For those parts 
inside the parquet-mr codebase, can  some of them implement/enforce this 
pattern fully internally? I'd guess most/all of them eventually pass this 
responsibility to the "app" code above parquet-mr API; so probably this pattern 
documentation should be moved/copied/referenced in these APIs. 





> Thread safety bug in CodecFactory
> ---------------------------------
>
>                 Key: PARQUET-2126
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2126
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: James Turton
>            Priority: Major
>
> The code for returning Compressor objects to the caller goes to some lengths 
> to achieve thread safety, including keeping Codec objects in an Apache 
> Commons pool that has thread-safe borrow semantics.  This is all undone by 
> the BytesCompressor and BytesDecompressor Maps in 
> org.apache.parquet.hadoop.CodecFactory which end up caching single compressor 
> and decompressor instances due to code in CodecFactory@getCompressor and 
> CodecFactory@getDecompressor.  When the caller runs multiple threads, those 
> threads end up sharing compressor and decompressor instances.
> For compressors based on Xerial Snappy this bug has no effect because that 
> library is itself thread safe.  But when BuiltInGzipCompressor from Hadoop is 
> selected for the CompressionCodecName.GZIP case, serious problems ensue.  
> That class is not thread safe and sharing one instance of it between threads 
> produces both silent data corruption and JVM crashes.
> To fix this situation, parquet-mr should stop caching single compressor and 
> decompressor instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to