Eric Secules created NIFI-8107:
----------------------------------
Summary: ExtractText Should Search Entire FlowFile Using Streaming
Key: NIFI-8107
URL: https://issues.apache.org/jira/browse/NIFI-8107
Project: Apache NiFi
Issue Type: New Feature
Reporter: Eric Secules
Assignee: Eric Secules
There should be an improvement to ExtractText so that the entire content of the
flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which overlap by
MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction over arbitrary
size files while keeping memory consumption limited.
Consider the use case where I am looking to extract a small pattern of maybe
100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText
source code, it always allocates a byte array of the maximum size, so it
probably wouldn't be appropriate to set that parameter too high. It's essential
to have the chunks overlap by the maximum length of the capture group because
the match may straddle two chunks. For the same reason it's not advisable to
split the flowfile into chunks of MAX_BUFFER_SIZE using existing processors.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)