[
https://issues.apache.org/jira/browse/NIFI-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Ward updated NIFI-13930:
-----------------------------
Description:
Hi,
Note: I am raising this issue on Databrick's behalf, they've requested Jira
access and are currently awaiting approval.
We use NiFi to write files to an Azure Storage account where our Databricks
workspace can ingest the files using an Azure Queue and Databrick's File
Notification feature, which then initiates a workflow/job.
However, files are not being picked up in a timely manner by Databricks.
We raised with Databricks and they've investigated, with the conclusion being
that NiFi's behaviour when completing the file write, and then subsequent
rename, does not emit and event type that would normally be expected.
Please see Databricks's summary below for information:
{quote} # Customer uses Apache Nifi 1.23.2, which performs the following
operations via the Azure API
([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
## Create a temp file using the [Path
Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
API. In this case Azure emits a
[BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
event with the {{api}} field set to {{CreateFile}} (which is purposefully not
processed by CSMS -
[source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
##
[Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
content to the file, there is no file event for this operation.
##
[Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
the appended content file content. Nifi’s implementation flushes the file
without closing it
([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
this event type).
##
[Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
the temp file to its final name, this generates a
[BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
that’s processed by CSMS (but ignored, see below).
# The behaviour observed by the customer is explained by:
## When (1d) happens, CSMS gets the BlobRenamed event for a file for which it
doesn’t know the metadata about the source file. The BlobRenamed event doesn’t
contain all the information needed by CSMS (it misses the {{etag}} and the
{{{}blob size{}}}) and therefore [CSMS ignores the
event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
and no object is created.
## When CSMS performs the daily full scan, it finds the renamed file and
creates an object for it in its database. This causes the file arrival trigger
to find the file and trigger a (delayed) run.{quote}
was:
Hi,
Note: I am raising this issue on Databrick's behalf, they've requested Jira
access and are currently awaiting approval.
We use NiFi to write files to an Azure Storage account where out Databricks
workspace can ingest the files using a Queue in Azure and Databrick's File
Notification feature.
However, files are not being picked up in a timely manner by Databricks.
We raised with Databricks and they've investigated, with the conclusion being
that NiFi's behaviour when completing the file write, and then subsequent
rename, does not emit and event type that would normally be expected.
Please see Databricks's summary below for information:
{quote} # Customer uses Apache Nifi 1.23.2, which performs the following
operations via the Azure API
([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
## Create a temp file using the [Path
Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
API. In this case Azure emits a
[BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
event with the {{api}} field set to {{CreateFile}} (which is purposefully not
processed by CSMS -
[source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
##
[Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
content to the file, there is no file event for this operation.
##
[Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
the appended content file content. Nifi’s implementation flushes the file
without closing it
([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
this event type).
##
[Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
the temp file to its final name, this generates a
[BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
that’s processed by CSMS (but ignored, see below).
# The behaviour observed by the customer is explained by:
## When (1d) happens, CSMS gets the BlobRenamed event for a file for which it
doesn’t know the metadata about the source file. The BlobRenamed event doesn’t
contain all the information needed by CSMS (it misses the {{etag}} and the
{{{}blob size{}}}) and therefore [CSMS ignores the
event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
and no object is created.
## When CSMS performs the daily full scan, it finds the renamed file and
creates an object for it in its database. This causes the file arrival trigger
to find the file and trigger a (delayed) run.{quote}
> PutAzureDataLakeStorage does not cause Azure to emit a FlushWithClose event
> on file write
> -----------------------------------------------------------------------------------------
>
> Key: NIFI-13930
> URL: https://issues.apache.org/jira/browse/NIFI-13930
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.23.2
> Environment: amd64, Windows Server 2022, Java 21.0.3
> Reporter: Mark Ward
> Priority: Minor
>
> Hi,
> Note: I am raising this issue on Databrick's behalf, they've requested Jira
> access and are currently awaiting approval.
> We use NiFi to write files to an Azure Storage account where our Databricks
> workspace can ingest the files using an Azure Queue and Databrick's File
> Notification feature, which then initiates a workflow/job.
> However, files are not being picked up in a timely manner by Databricks.
> We raised with Databricks and they've investigated, with the conclusion being
> that NiFi's behaviour when completing the file write, and then subsequent
> rename, does not emit and event type that would normally be expected.
> Please see Databricks's summary below for information:
> {quote} # Customer uses Apache Nifi 1.23.2, which performs the following
> operations via the Azure API
> ([src|https://github.com/apache/nifi/blob/rel/nifi-1.23.2/nifi-nar-bundles/nifi-azure-bundle/nifi-azure-processors/src/main/java/org/apache/nifi/processors/azure/storage/PutAzureDataLakeStorage.java#L149-L166])
> ## Create a temp file using the [Path
> Create|https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create?view=rest-storageservices-datalakestoragegen2-2019-12-12]
> API. In this case Azure emits a
> [BlobCreated|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobcreated-event-data-lake-storage-gen2-1]
> event with the {{api}} field set to {{CreateFile}} (which is purposefully
> not processed by CSMS -
> [source|https://github.com/databricks/universe/blob/fc4a34b61abe58fbe363cc55cdb67edd480f985e/jobs-cloud-storage-meta/src/cloud/azure/resourcemanagement/AqsEventGridResourceManagementClient.scala#L366]).
> ##
> [Append|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L656]
> content to the file, there is no file event for this operation.
> ##
> [Flush|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L930]
> the appended content file content. Nifi’s implementation flushes the file
> without closing it
> ([source|https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/storage/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java#L935]),
> as a result Azure +doesn’t+ emit a {{FlushWithClose}} event (CSMS processes
> this event type).
> ##
> [Rename|https://learn.microsoft.com/en-us/dotnet/api/azure.storage.files.datalake.datalakefileclient.rename?view=azure-dotnet]
> the temp file to its final name, this generates a
> [BlobRenamed|https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&tabs=event-grid-event-schema#microsoftstorageblobrenamed-event-data-lake-storage-gen2-1]
> that’s processed by CSMS (but ignored, see below).
> # The behaviour observed by the customer is explained by:
> ## When (1d) happens, CSMS gets the BlobRenamed event for a file for which
> it doesn’t know the metadata about the source file. The BlobRenamed event
> doesn’t contain all the information needed by CSMS (it misses the {{etag}}
> and the {{{}blob size{}}}) and therefore [CSMS ignores the
> event|https://github.com/databricks/universe/blob/master/jobs-cloud-storage-meta/src/storage/StorageHelper.scala#L600]
> and no object is created.
> ## When CSMS performs the daily full scan, it finds the renamed file and
> creates an object for it in its database. This causes the file arrival
> trigger to find the file and trigger a (delayed) run.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)