[jira] [Created] (MINIFICPP-2400) Partially cloned flow file having the parent's resource claim causes slow reads

Jira Tue, 11 Jun 2024 09:15:06 -0700

Gábor Gyimesi created MINIFICPP-2400:
----------------------------------------


             Summary: Partially cloned flow file having the parent's resource 
claim causes slow reads
                 Key: MINIFICPP-2400
                 URL: https://issues.apache.org/jira/browse/MINIFICPP-2400
             Project: Apache NiFi MiNiFi C++
          Issue Type: Improvement
            Reporter: Gábor Gyimesi


The SplitText processor splits a text file stored in a flow file into multiple 
smaller flow files on line boundaries, by cloning parts of the original flow 
file. The clone() function creates a new flow file, but the the resource claim 
is inherited, so the content is only stored once, and the new flow files only 
reference the desired parts. 

In case we have a large flow file which is splitted into a lot of small flow 
files, when the splitted flow files are processed by the next processor in the 
flow, their content is usually read from the database by the processor.  As 
only a single RocksDB value is shared between these flow files the whole data 
in that resource claim has to be read instead only the part referenced by the 
smaller flow files (with the current use of the RocksDB's Get function). This 
causes very slow processing of the splitted flow files, because even if the 
flow file size is only 1KB, if its referencing a 100MB value in the DB, then we 
have to read all 100MB of data from RocksDB.

We should investigate the options to avoid this issue and implement a viable 
solution. The options that could be investigated:
 * Create a version of the clone() function that creates a new resource claim 
instead of referencing part of the original
 * Remove the option to partially clone a flow file and create new flow files 
with the desired content
 * Find a way to read the partial value from RocksDB used in the flow file 
instead of the whole value of the resource claim

Other processors using session.clone() mechanism should also be investigated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (MINIFICPP-2400) Partially cloned flow file having the parent's resource claim causes slow reads

Reply via email to