[ 
https://issues.apache.org/jira/browse/NIFI-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Turcsanyi updated NIFI-13930:
-----------------------------------
    Status: Patch Available  (was: Open)

> PutAzureDataLakeStorage does not cause Azure to emit a FlushWithClose event 
> on file write
> -----------------------------------------------------------------------------------------
>
>                 Key: NIFI-13930
>                 URL: https://issues.apache.org/jira/browse/NIFI-13930
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.23.2
>         Environment: amd64, Windows Server 2022, Java 21.0.3
>            Reporter: Mark Ward
>            Assignee: Peter Turcsanyi
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> Note: I am raising this issue on Databrick's behalf, they've requested Jira 
> access and are currently awaiting approval.
> We use NiFi to write files to an Azure Storage account where our Databricks 
> workspace can ingest the files using an Azure Queue and Databrick's File 
> Notification feature, which then initiates a workflow/job.
> However, files are not being picked up in a timely manner by Databricks.
> We raised with Databricks and they've investigated, with the conclusion being 
> that NiFi's behaviour when completing the file write, and then subsequent 
> rename, does not emit and event type that would normally be expected.
> Please see Databricks's summary below for information:
> {quote} # Customer uses Apache Nifi 1.23.2, which performs the following 
> operations via the Azure API 
> ([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
>  ## Create a temp file using the [Path 
> Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
>  API. In this case Azure emits a 
> [BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
>  event with the {{api}} field set to {{CreateFile}} (which is purposefully 
> not processed by CSMS - 
> [source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
>  ## 
> [Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
>  content to the file, there is no file event for this operation.
>  ## 
> [Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
>  the appended content file content. Nifi’s implementation flushes the file 
> without closing it 
> ([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
>  as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes 
> this event type).
>  ## 
> [Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
>  the temp file to its final name, this generates a 
> [BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
>  that’s processed by CSMS (but ignored, see below).
>  # The behaviour observed by the customer is explained by:
>  ## When (1d) happens, CSMS gets the BlobRenamed event for a file for which 
> it doesn’t know the metadata about the source file. The BlobRenamed event 
> doesn’t contain all the information needed by CSMS (it misses the {{etag}} 
> and the {{{}blob size{}}}) and therefore [CSMS ignores the 
> event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
>  and no object is created.
>  ## When CSMS performs the daily full scan, it finds the renamed file and 
> creates an object for it in its database. This causes the file arrival 
> trigger to find the file and trigger a (delayed) run.{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to