exceptionfactory commented on PR #8691: URL: https://github.com/apache/nifi/pull/8691#issuecomment-2135201964
> > Thanks for the continued work on this Processor @JackHintonSmartDCSIT, and thanks for the thorough review thus far @dan-s1. > > Reviewing the latest version, there is a serious concern with reading the entire input FlowFile content into memory. > > I have not evaluated all the details of the `PCAP` class, but it looks like it should be possible to read packets in a streaming fashion, which would avoid significant memory consumption. Although the current approach could work for small files, this needs to be refactored as it will not scale for larger files. > > The assumption used in the creation of this processor was that the flowfile to be split would already be present in NiFi in order to be processed, which (one would assume) necessitates that the PCAP file is already present in memory in its' entirety. I don't disagree with your assessment, but I would like to clarify; is there a way of streaming flowfiles between processors in such a way that they aren't required to be held entirely in memory? If so, can you point me to a processor that uses this mechanism or some documentation of it so I can adopt it here? Thanks! Thanks for the reply @JackHintonSmartDCSIT. FlowFile content in a NiFi queue does not use heap memory, and instead uses a persistent repository. This allows NiFi to process very large files with relatively small amounts of memory in many cases. The vast majority of NiFi processors do not read the entire FlowFile content into memory, and those that do are more exceptional cases. The `SplitRecord` Processors is one example of a Processor that reads an InputStream, consisting of an unknown number of Records, and then writes a segmented number of Records to one or more output FlowFiles. Some of the other Split Processors follow a similar pattern. Based on the code as it stands, it looks like it will be necessary to refactor the PCAP reading to emit a stream of packets. The `java.util.Iterator` interface is one general way to model this strategy, but the details depend on consuming bytes from an InputStream, then writing packets to a FlowFile OutputStream, which avoids retaining a large number of objects in memory. I realize this will take some work to refactor, but reviewing some of the existing Split Processors should provide some helpful background examples. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
