Paul Kelly created NIFI-15969:
---------------------------------

             Summary: PutS3Object can corrupt data when two files with the same 
name are simultaneously uploaded with multipart
                 Key: NIFI-15969
                 URL: https://issues.apache.org/jira/browse/NIFI-15969
             Project: Apache NiFi
          Issue Type: Bug
            Reporter: Paul Kelly
             Fix For: 2.9.0


This is a very rare edge case, but is something we have seen happen a handful 
of times over the years.  It happened again recently and I was able to review 
logs to identify the cause.

PutS3Object keeps track of its multipart state based only on the bucket name 
and object key.  If you try to simultaneously upload two files to a bucket and 
those files have the same name, the various parts will get mixed together and 
the data that ultimately ends up in S3 is corrupt.

For this to happen, the files have to have the same name, use the same tracking 
directory (either because it's the same node with local storage or because it's 
using a network storage across different nodes), be large enough that they get 
uploaded with multipart, and be large enough that they are both uploading at 
the same time.  Because of how the multipart state is tracked, it doesn't 
matter if a single PutS3Object processor is scheduled with multiple threads, or 
if two different PutS3Object processors on the same NiFi node happen to upload 
files with the same names to the same bucket.

I know this is rare, but there are valid uses for sending data with the same 
name and expecting two versions to end up in the bucket.  We see it when we use 
NiFi to download data from a system where we do not control the file names, and 
store those results in a versioned S3 bucket.

For a versioned bucket, ultimately what should happen is you should end up with 
two different versions of the object, one for each upload.  For a non-versioned 
bucket, which ever upload finishes last would replace the first object.  What 
is happening is we end up with one corrupt object containing parts of both 
uploads regardless of the versioning.

I think it would make sense to add the flow file's uuid into the state tracking 
so that the state for two different flow files cannot be mixed together.  If 
one flow file needs to retry the upload, most of the time it will have the same 
uuid and PutS3Object can restore the state as it does now.  If it has a new 
uuid, the upload will start again from the beginning with a fresh state, which 
is better than ending up with a corrupt file in the mentioned scenario.

There are ways to work around this, of course, such as appending the uuid to 
the key name when uploading the file, but this is not ideal for when we don't 
control the downstream system that ingests from the S3 bucket.  Likewise, we 
could also schedule PutS3Object to only run one thread, but this has throughput 
implications, and it still doesn't fix the issue of having two PutS3Objects 
uploading files with the same name at the same time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to