[
https://issues.apache.org/jira/browse/NIFI-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892762#comment-17892762
]
Peter Turcsanyi commented on NIFI-13930:
----------------------------------------
[~_mark_] Thanks for filing this issue! It makes sense to generate a "close"
event at the end of the file upload and we can modify PutAzureDataLakeStorage
to do it. Please note that this event will be generated for the temp file and
at a time when the final (renamed) file is not yet available. If I understand
correctly, your target system will be able to handle it (that is using the info
from FlushWithClose event (1c) when RenameFile event (1d) happens and the file
is available). Could you please confirm it?
> PutAzureDataLakeStorage does not cause Azure to emit a FlushWithClose event
> on file write
> -----------------------------------------------------------------------------------------
>
> Key: NIFI-13930
> URL: https://issues.apache.org/jira/browse/NIFI-13930
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.23.2
> Environment: amd64, Windows Server 2022, Java 21.0.3
> Reporter: Mark Ward
> Assignee: Peter Turcsanyi
> Priority: Minor
>
> Hi,
> Note: I am raising this issue on Databrick's behalf, they've requested Jira
> access and are currently awaiting approval.
> We use NiFi to write files to an Azure Storage account where our Databricks
> workspace can ingest the files using an Azure Queue and Databrick's File
> Notification feature, which then initiates a workflow/job.
> However, files are not being picked up in a timely manner by Databricks.
> We raised with Databricks and they've investigated, with the conclusion being
> that NiFi's behaviour when completing the file write, and then subsequent
> rename, does not emit and event type that would normally be expected.
> Please see Databricks's summary below for information:
> {quote} # Customer uses Apache Nifi 1.23.2, which performs the following
> operations via the Azure API
> ([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
> ## Create a temp file using the [Path
> Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
> API. In this case Azure emits a
> [BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
> event with the {{api}} field set to {{CreateFile}} (which is purposefully
> not processed by CSMS -
> [source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
> ##
> [Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
> content to the file, there is no file event for this operation.
> ##
> [Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
> the appended content file content. Nifi’s implementation flushes the file
> without closing it
> ([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
> as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
> this event type).
> ##
> [Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
> the temp file to its final name, this generates a
> [BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
> that’s processed by CSMS (but ignored, see below).
> # The behaviour observed by the customer is explained by:
> ## When (1d) happens, CSMS gets the BlobRenamed event for a file for which
> it doesn’t know the metadata about the source file. The BlobRenamed event
> doesn’t contain all the information needed by CSMS (it misses the {{etag}}
> and the {{{}blob size{}}}) and therefore [CSMS ignores the
> event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
> and no object is created.
> ## When CSMS performs the daily full scan, it finds the renamed file and
> creates an object for it in its database. This causes the file arrival
> trigger to find the file and trigger a (delayed) run.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)