[
https://issues.apache.org/jira/browse/NIFI-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Turcsanyi updated NIFI-13930:
-----------------------------------
Status: Patch Available (was: Open)
> PutAzureDataLakeStorage does not cause Azure to emit a FlushWithClose event
> on file write
> -----------------------------------------------------------------------------------------
>
> Key: NIFI-13930
> URL: https://issues.apache.org/jira/browse/NIFI-13930
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.23.2
> Environment: amd64, Windows Server 2022, Java 21.0.3
> Reporter: Mark Ward
> Assignee: Peter Turcsanyi
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Hi,
> Note: I am raising this issue on Databrick's behalf, they've requested Jira
> access and are currently awaiting approval.
> We use NiFi to write files to an Azure Storage account where our Databricks
> workspace can ingest the files using an Azure Queue and Databrick's File
> Notification feature, which then initiates a workflow/job.
> However, files are not being picked up in a timely manner by Databricks.
> We raised with Databricks and they've investigated, with the conclusion being
> that NiFi's behaviour when completing the file write, and then subsequent
> rename, does not emit and event type that would normally be expected.
> Please see Databricks's summary below for information:
> {quote} # Customer uses Apache Nifi 1.23.2, which performs the following
> operations via the Azure API
> ([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
> ## Create a temp file using the [Path
> Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
> API. In this case Azure emits a
> [BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
> event with the {{api}} field set to {{CreateFile}} (which is purposefully
> not processed by CSMS -
> [source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
> ##
> [Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
> content to the file, there is no file event for this operation.
> ##
> [Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
> the appended content file content. Nifi’s implementation flushes the file
> without closing it
> ([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
> as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
> this event type).
> ##
> [Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
> the temp file to its final name, this generates a
> [BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
> that’s processed by CSMS (but ignored, see below).
> # The behaviour observed by the customer is explained by:
> ## When (1d) happens, CSMS gets the BlobRenamed event for a file for which
> it doesn’t know the metadata about the source file. The BlobRenamed event
> doesn’t contain all the information needed by CSMS (it misses the {{etag}}
> and the {{{}blob size{}}}) and therefore [CSMS ignores the
> event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
> and no object is created.
> ## When CSMS performs the daily full scan, it finds the renamed file and
> creates an object for it in its database. This causes the file arrival
> trigger to find the file and trigger a (delayed) run.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)