markap14 commented on pull request #5324: URL: https://github.com/apache/nifi/pull/5324#issuecomment-1031949097
@markobean I'm not sure that I agree that this is a great idea. Having worked with many, many NiFi users, probably the single biggest issue that I encounter is users putting large values into attributes. Attributes should always be small. Think 100-200 characters or less, generally. They should store things like `filename = my-file.txt` and `documentId = 27830278`, etc. Tiny key-value pairs. They should never contain large amounts of contents like XML or JSON. To fully understand why, consider how attributes are handled. - Firstly, they are stored in HashMaps on the FlowFile and are always stored in memory (unless swapped out in a queue). This very quickly results in massive amounts of garbage collection and OutOfMemoryErrors. - They also have to be written out to the FlowFile Repository when any attribute changes. If you change the filename, for instance, the next update to the FlowFile repository writes out *all* attributes. Now, if we have an 8 KB XML message in an attribute, we don't write 200 bytes, we write 8 KB. Potentially every time the FlowFile moves from 1 queue to the next. - We also have to write the attributes for every single Provenance event. So we have doubled the expense just mentioned. - If the data is swapped out, we get to remove it from heap. But that happens by default at 20,000 FlowFiles and only in chunks of 10,000 FlowFiles. So we need to buffer about 30,000 FlowFiles *per FlowFile Queue*! - And if we swap it out, we then get to drop those attributes from the heap, but we have to write out that big XML chunk *again*. And we have to read it in and deserialize it again when swapping back in. Performance will suffer greatly. But out of all of these concerns, the greatest, by far, is the heap exhaustion and OutOfMemoryErrors that get generated. When a user buffers a bunch of data into FlowFile attributes, they can significantly affect all users of the system. Once you buffer so much that you start throwing OutOfMemoryErrors, you're now into an area where you can potentially even start to affect the stability of the cluster because nodes are constantly performing garbage collection, etc. Because of this, we should very much discourage putting XML, JSON, or any other large chunks of content into FlowFile attributes. So I have trouble supporting any update that further encourages this anti-pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
