[
https://issues.apache.org/jira/browse/NIFI-11971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Payne reopened NIFI-11971:
-------------------------------
> FlowFile content is corrupted across the whole NiFi instance throughout
> ProcessSession::write with omitting writing any byte to OutputStream
> --------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-11971
> URL: https://issues.apache.org/jira/browse/NIFI-11971
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 1.23.0, 1.23.1
> Reporter: Serhii Nesterov
> Assignee: Mark Payne
> Priority: Blocker
> Labels: corruption
> Fix For: 2.0.0, 1.24.0, 1.23.2
>
> Attachments: image-2023-08-20-19-31-16-598.png,
> image-2023-08-20-19-37-43-772.png, image-2023-08-20-19-38-03-391.png,
> image-2023-08-20-19-42-37-029.png, image-2023-08-20-19-43-03-697.png,
> image-2023-08-20-20-01-50-445.png, image-2023-08-21-13-21-31-091.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> One of the scenarios for ProcessSession::write was broken after recent code
> refactoring within the following pull request:
> [https://github.com/apache/nifi/pull/7363/files]
> The issue is located in _StandardContentClaimWriteCache.java_ in the
> _write(final ContentClaim claim)_ method that returns an _OutputStream_ used
> in the _OutputStreamCallback_ interface to let NiFi processors write flowfile
> content through the {_}ProcessSession::write method{_}.
> If a processor calls _session.write_ but does not write any data to the
> output stream, then none of the write methods in the _OutputStream_ is
> invoked, hence the length of the content claim is not recomputed, meaning the
> length will have the default value that is equal to {*}-1{*}. Because of the
> latest refactoring changes that are based on creating a new content claim on
> each _ProcessSession::write_ invocation the following formula gives the wrong
> result:
> {code:java}
> previous offset + previous length = new offset.{code}
> or as in the codebase:
> {code:java}
> scc = new StandardContentClaim(scc.getResourceClaim(), scc.getOffset() +
> scc.getLength());{code}
> For example, if the previous offset was 1000 and nothing was written to the
> stream (length is -1), then 1000 + (-1) will give us 999 which means that the
> offset is shifted back by one, hence the next content will have an extra
> character from the previous content at the beginning and will lose the last
> character at the end, and all other FlowFiles anywhere in NiFi will be
> corrupted by this defect until the NiFi instance is restarted.
> The following steps can be taken to reproduce the issue (critical in our
> commercial project):
> * Create an empty text file (“a.txt”);
> * Create a text file with any text (“b.txt”);
> * Package these files into a .zip archive;
> * Put it into a file system on Azure Cloud (we use ADLS Gen2);
> * Read the zip file and unpack its content on the NiFi Canvas using the
> _FetchAzureDataLakeStorage_ and _UnpackContent_ processors;
> * Start a flow with the _GenerateFlowFile_ processor. See the results. The
> empty file must be extracted before the non-empty file, otherwise the issue
> won’t reproduce. You’ll see that the second FlowFile content will be
> corrupted – the first character is an unreadable character from the zip
> archive (last character of the content with zip) fetched with
> _FetchAzureDataLakeStorage_ and the last character will be lost. Starting
> from this point, NiFi cannot be used at all because any other processors will
> lead to FlowFile content corruption across the entire NiFi instance due to
> the shifted offset.
> A sample canvas:
> !image-2023-08-20-19-31-16-598.png|width=969,height=492!
>
> Important note: the issue is not reproducible if an empty file is a last file
> to be extracted (the length will be reset when the processor completes), or
> if you do not call _session.write()_ when a file has 0 bytes (in case if you
> create your own processor with such logic).
> The offsets for the above picture will look like as follows (#1 - after
> fetching and unpacking an empty file, #2 - before unpacking the second file):
> !image-2023-08-20-19-37-43-772.png|width=961,height=32!
> !image-2023-08-20-19-38-03-391.png|width=960,height=35!
> 1524 - after FetchAzureDataLakeStorage and UnpackContent for the empty file.
> The length *-1* will be kept instead of *0* and used for the next file which
> is why the next offset is equal to 1523 ({*}1524 + (-1) = 1523{*}).
> if your file has the "Hello world" text inside, then after downloading this
> unpacked file from NiFi you'll see (the first character here is a space):
> !image-2023-08-20-20-01-50-445.png!
> Different processors will give you various errors due to the corrupted
> content especially for the json format and queries:
> !image-2023-08-20-19-42-37-029.png!
> !image-2023-08-20-19-43-03-697.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)