[
https://issues.apache.org/jira/browse/NIFI-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tamas Palfy reassigned NIFI-8676:
---------------------------------
Assignee: Tamas Palfy
> ListS3 and ListGCSObject sometimes miss objects in very active buckets
> ----------------------------------------------------------------------
>
> Key: NIFI-8676
> URL: https://issues.apache.org/jira/browse/NIFI-8676
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.13.2
> Reporter: Paul Kelly
> Assignee: Tamas Palfy
> Priority: Major
> Labels: gcs, s3
> Attachments: flow.xml.gz
>
>
> ListS3 and ListGCSBucket occasionally miss some objects in very active
> buckets and never list them. Through testing, it appears that exclusively
> using an object's last modified date for state tracking is unreliable when a
> large dump of objects of various sizes is uploaded simultaneously. For some
> reason, newer but smaller files are sometimes listed before older but larger
> files, which messes up the timestamp tracking state of the ListS3 and
> ListGCSBucket processors.
> We have flows that operate as ListS3 -> FetchS3Object -> DeleteS3Object ->
> (downstream processing) and ListGCSBucket -> FetchGCSObject ->
> DeleteGCSObject -> (downstream processing). We often notice files remain in
> the bucket until we manually clear the state of the relevant List processor
> and restart it. Examining the provenance logs shows that the objects that
> remained were never listed, which is confirmed by logs within the downstream
> processing showing the objects never made it there.
> Attached is a sample flow.xml.gz file which replicates this problem by
> simulating extreme conditions for both GCS and S3. Two GenerateFlowFile
> processors run with a schedule of 0.01 seconds. One of them generates flow
> files of size 1B and the other generates flow files of size 1GB. These feed
> into a PutS3Object or PutGCSObject processor which is set to use 10
> concurrent threads, thus allowing 10 files to be uploaded simultaneously. The
> queue that is connected to the Put processors does not limit the number or
> size of flow files in order to preventing backpressure from causing the
> number of small and large sample flow files being uploaded simultaneously to
> become unbalanced.
> Another flow within the attached sample flow.xml.gz file uses
> ListS3/ListGCSBucket -> DeleteS3Object/DeleteGCSObject to mimic the receiving
> end where objects are missed. The List processors are set to a run schedule
> of 0 seconds to cause listing to occur as frequently as possible. After
> starting both the sending and receiving flows, you should see within a few
> seconds to a minute that the counts of flow files put into GCS or S3 are
> higher than the count of flow files output by the List processors.
> Additionally, if you stop the Put flow but let the receiving flow with its
> Delete processor continue to run, objects will remain in the bucket even
> after all queues are flushed. Examining provenance logs will confirm that
> those objects were never listed. Stopping the List processor, clearing its
> state, and restarting it will cause these remaining objects to be listed and
> then deleted by the Delete processor.
> We do not run into this problem with ListAzureBlobStorage since we can set it
> to track entities and not just track timestamps. ListS3 and ListGCSBucket do
> not allow tracking by entities and are hard-coded to only track timestamps.
> It'd be great if they could track by entities or if the timestamp issue could
> be resolved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)