[ 
https://issues.apache.org/jira/browse/NIFI-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Turcsanyi resolved NIFI-15969.
------------------------------------
    Fix Version/s: 2.10.0
                       (was: 2.9.0)
       Resolution: Fixed

> PutS3Object can corrupt data when two files with the same name are 
> simultaneously uploaded with multipart
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-15969
>                 URL: https://issues.apache.org/jira/browse/NIFI-15969
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Paul Kelly
>            Assignee: Rakesh Kumar Singh
>            Priority: Major
>             Fix For: 2.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a very rare edge case, but is something we have seen happen a handful 
> of times over the years.  It happened again recently and I was able to review 
> logs to identify the cause.
> PutS3Object keeps track of its multipart state based only on the bucket name 
> and object key.  If you try to simultaneously upload two files to a bucket 
> and those files have the same name, the various parts will get mixed together 
> and the data that ultimately ends up in S3 is corrupt.
> For this to happen, the files have to have the same name, use the same 
> tracking directory (either because it's the same node with local storage or 
> because it's using a network storage across different nodes), be large enough 
> that they get uploaded with multipart, and be large enough that they are both 
> uploading at the same time.  Because of how the multipart state is tracked, 
> it doesn't matter if a single PutS3Object processor is scheduled with 
> multiple threads, or if two different PutS3Object processors on the same NiFi 
> node happen to upload files with the same names to the same bucket.
> I know this is rare, but there are valid uses for sending data with the same 
> name and expecting two versions to end up in the bucket.  We see it when we 
> use NiFi to download data from a system where we do not control the file 
> names, and store those results in a versioned S3 bucket.
> For a versioned bucket, ultimately what should happen is you should end up 
> with two different versions of the object, one for each upload.  For a 
> non-versioned bucket, which ever upload finishes last would replace the first 
> object.  What is happening is we end up with one corrupt object containing 
> parts of both uploads regardless of the versioning.
> I think it would make sense to add the flow file's uuid into the state 
> tracking so that the state for two different flow files cannot be mixed 
> together.  If one flow file needs to retry the upload, most of the time it 
> will have the same uuid and PutS3Object can restore the state as it does now. 
>  If it has a new uuid, the upload will start again from the beginning with a 
> fresh state, which is better than ending up with a corrupt file in the 
> mentioned scenario.
> There are ways to work around this, of course, such as appending the uuid to 
> the key name when uploading the file, but this is not ideal for when we don't 
> control the downstream system that ingests from the S3 bucket.  Likewise, we 
> could also schedule PutS3Object to only run one thread, but this has 
> throughput implications, and it still doesn't fix the issue of having two 
> PutS3Objects uploading files with the same name at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to