Mark Ward created NIFI-13930:
--------------------------------
Summary: PutAzureDataLakeStorage does not cause Azure to emit a
FlushWithClose event on file write
Key: NIFI-13930
URL: https://issues.apache.org/jira/browse/NIFI-13930
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 1.23.2
Environment: amd64, Windows Server 2022, Java 21.0.3
Reporter: Mark Ward
Hi,
Note: I am raising this issue on Databrick's behalf, they've requested Jira
access and are currently awaiting approval.
We use NiFi to write files to an Azure Storage account where out Databricks
workspace can ingest the files using a Queue in Azure and Databrick's File
Notification feature.
However, files are not being picked up in a timely manner by Databricks.
We raised with Databricks and they've investigated, with the conclusion being
that NiFi's behaviour when completing the file write, and then subsequent
rename, does not emit and event type that would normally be expected.
Please see Databricks's summary below for information:
{quote} # Customer uses Apache Nifi 1.23.2, which performs the following
operations via the Azure API
([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
## Create a temp file using the [Path
Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
API. In this case Azure emits a
[BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
event with the {{api}} field set to {{CreateFile}} (which is purposefully not
processed by CSMS -
[source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
##
[Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
content to the file, there is no file event for this operation.
##
[Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
the appended content file content. Nifi’s implementation flushes the file
without closing it
([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
this event type).
##
[Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
the temp file to its final name, this generates a
[BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
that’s processed by CSMS (but ignored, see below).
# The behaviour observed by the customer is explained by:
## When (1d) happens, CSMS gets the BlobRenamed event for a file for which it
doesn’t know the metadata about the source file. The BlobRenamed event doesn’t
contain all the information needed by CSMS (it misses the {{etag}} and the
{{{}blob size{}}}) and therefore [CSMS ignores the
event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
and no object is created.
## When CSMS performs the daily full scan, it finds the renamed file and
creates an object for it in its database. This causes the file arrival trigger
to find the file and trigger a (delayed) run.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)