[
https://issues.apache.org/jira/browse/NIFI-8107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Secules reassigned NIFI-8107:
----------------------------------
Assignee: (was: Eric Secules)
> ExtractText Should Search Entire FlowFile Using Streaming
> ---------------------------------------------------------
>
> Key: NIFI-8107
> URL: https://issues.apache.org/jira/browse/NIFI-8107
> Project: Apache NiFi
> Issue Type: New Feature
> Reporter: Eric Secules
> Priority: Major
>
> There should be an improvement to ExtractText so that the entire content of
> the flowfile is scanned for matches in chunks of MAX_BUFFER_SIZE which
> overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can do pattern extraction
> over arbitrary size files while keeping memory consumption limited.
> Consider the use case where I am looking to extract a small pattern of maybe
> 100 bytes from files that could be 1MB or 500MB. Looking at the ExtractText
> source code, it always allocates a byte array of the maximum size, so it
> probably wouldn't be appropriate to set that parameter too high. It's
> essential to have the chunks overlap by the maximum length of the capture
> group because the match may straddle two chunks. For the same reason it's not
> advisable to split the flowfile into chunks of MAX_BUFFER_SIZE using existing
> processors.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)