[ 
https://issues.apache.org/jira/browse/NIFI-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamas Palfy reassigned NIFI-8676:
---------------------------------

    Assignee: Tamas Palfy

> ListS3 and ListGCSObject sometimes miss objects in very active buckets
> ----------------------------------------------------------------------
>
>                 Key: NIFI-8676
>                 URL: https://issues.apache.org/jira/browse/NIFI-8676
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.13.2
>            Reporter: Paul Kelly
>            Assignee: Tamas Palfy
>            Priority: Major
>              Labels: gcs, s3
>         Attachments: flow.xml.gz
>
>
> ListS3 and ListGCSBucket occasionally miss some objects in very active 
> buckets and never list them. Through testing, it appears that exclusively 
> using an object's last modified date for state tracking is unreliable when a 
> large dump of objects of various sizes is uploaded simultaneously. For some 
> reason, newer but smaller files are sometimes listed before older but larger 
> files, which messes up the timestamp tracking state of the ListS3 and 
> ListGCSBucket processors.
> We have flows that operate as ListS3 -> FetchS3Object -> DeleteS3Object -> 
> (downstream processing) and ListGCSBucket -> FetchGCSObject -> 
> DeleteGCSObject -> (downstream processing). We often notice files remain in 
> the bucket until we manually clear the state of the relevant List processor 
> and restart it. Examining the provenance logs shows that the objects that 
> remained were never listed, which is confirmed by logs within the downstream 
> processing showing the objects never made it there.
> Attached is a sample flow.xml.gz file which replicates this problem by 
> simulating extreme conditions for both GCS and S3. Two GenerateFlowFile 
> processors run with a schedule of 0.01 seconds. One of them generates flow 
> files of size 1B and the other generates flow files of size 1GB. These feed 
> into a PutS3Object or PutGCSObject processor which is set to use 10 
> concurrent threads, thus allowing 10 files to be uploaded simultaneously. The 
> queue that is connected to the Put processors does not limit the number or 
> size of flow files in order to preventing backpressure from causing the 
> number of small and large sample flow files being uploaded simultaneously to 
> become unbalanced.
> Another flow within the attached sample flow.xml.gz file uses 
> ListS3/ListGCSBucket -> DeleteS3Object/DeleteGCSObject to mimic the receiving 
> end where objects are missed. The List processors are set to a run schedule 
> of 0 seconds to cause listing to occur as frequently as possible. After 
> starting both the sending and receiving flows, you should see within a few 
> seconds to a minute that the counts of flow files put into GCS or S3 are 
> higher than the count of flow files output by the List processors. 
> Additionally, if you stop the Put flow but let the receiving flow with its 
> Delete processor continue to run, objects will remain in the bucket even 
> after all queues are flushed. Examining provenance logs will confirm that 
> those objects were never listed. Stopping the List processor, clearing its 
> state, and restarting it will cause these remaining objects to be listed and 
> then deleted by the Delete processor.
> We do not run into this problem with ListAzureBlobStorage since we can set it 
> to track entities and not just track timestamps. ListS3 and ListGCSBucket do 
> not allow tracking by entities and are hard-coded to only track timestamps. 
> It'd be great if they could track by entities or if the timestamp issue could 
> be resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to