Gábor Gyimesi created MINIFICPP-2400:
----------------------------------------
Summary: Partially cloned flow file having the parent's resource
claim causes slow reads
Key: MINIFICPP-2400
URL: https://issues.apache.org/jira/browse/MINIFICPP-2400
Project: Apache NiFi MiNiFi C++
Issue Type: Improvement
Reporter: Gábor Gyimesi
The SplitText processor splits a text file stored in a flow file into multiple
smaller flow files on line boundaries, by cloning parts of the original flow
file. The clone() function creates a new flow file, but the the resource claim
is inherited, so the content is only stored once, and the new flow files only
reference the desired parts.
In case we have a large flow file which is splitted into a lot of small flow
files, when the splitted flow files are processed by the next processor in the
flow, their content is usually read from the database by the processor. As
only a single RocksDB value is shared between these flow files the whole data
in that resource claim has to be read instead only the part referenced by the
smaller flow files (with the current use of the RocksDB's Get function). This
causes very slow processing of the splitted flow files, because even if the
flow file size is only 1KB, if its referencing a 100MB value in the DB, then we
have to read all 100MB of data from RocksDB.
We should investigate the options to avoid this issue and implement a viable
solution. The options that could be investigated:
* Create a version of the clone() function that creates a new resource claim
instead of referencing part of the original
* Remove the option to partially clone a flow file and create new flow files
with the desired content
* Find a way to read the partial value from RocksDB used in the flow file
instead of the whole value of the resource claim
Other processors using session.clone() mechanism should also be investigated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)