[jira] [Assigned] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Tamas Palfy (Jira) Thu, 02 Dec 2021 06:14:09 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tamas Palfy reassigned NIFI-9436:
---------------------------------

    Assignee: Tamas Palfy

> PutParquet, PutORC processors may hang after writing to ADLS
> ------------------------------------------------------------
>
>                 Key: NIFI-9436
>                 URL: https://issues.apache.org/jira/browse/NIFI-9436
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>
> h2. Background
> In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
> *org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write 
> record to an HDFS location and is later _closed_ explicitly.
> The writer creation process involves the instantiation of an 
> *org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is 
> going to be an *org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem*.
> Note that the NiFi AbstractPutHDFSRecord processor already created a 
> FileSystem object of the same type for it's own purposes but the writer 
> creates it's own.
> The writer only uses the FileSystem object to create an 
> *org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
> keep the FileSystem object itself.
> This makes the FileSystem object eligible for garbage collection.
> The AbfsOutputStream writes data asynchronously. Submits the task to an 
> executorservice and stores it in a collection.
> h2. The issue
> * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to 
> the same _ThreadPoolExecutor_ object.
> * _AzureBlobFileSystem_ (probably depending on the version) overrides the 
> _finalize()_ method and closes itself when that is called. This involves 
> shutting down the referenced _ThreadPoolExecutor_.
> * It's possible for garbage collection to occur after the _ParquetWriter_ is 
> created but before explicitly closing it. GC -> 
> _AzureBlobFileSystem.finalize()_ -> _ThreadPoolExecutor.shutdown()_.
> * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job 
> using the _ThreadPoolExecutor_. That job submission fails as the 
> _ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
> created -  and is being wait for indefinitely.
> This causes the processor to hang.
> h2. The solution
> This feels like an issue that should be addressed in the _hadoop-azure_ 
> library but it's possible to apply a workaround in NiFi.
> The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
> if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
> already created for itself -and kept the reference for- it would prevent the 
> garbage collection to occur.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Reply via email to