[GitHub] [nifi] markap14 commented on pull request #5324: NIFI-9072: improvements to ValidateXML including validate XML in attr…

GitBox Mon, 07 Feb 2022 13:35:13 -0800


markap14 commented on pull request #5324:
URL: https://github.com/apache/nifi/pull/5324#issuecomment-1031949097



   @markobean I'm not sure that I agree that this is a great idea. Having 
worked with many, many NiFi users, probably the single biggest issue that I 
encounter is users putting large values into attributes. Attributes should 
always be small. Think 100-200 characters or less, generally. They should store 
things like `filename = my-file.txt` and `documentId = 27830278`, etc. Tiny 
key-value pairs. They should never contain large amounts of contents like XML 
or JSON.
   
   To fully understand why, consider how attributes are handled.
   - Firstly, they are stored in HashMaps on the FlowFile and are always stored 
in memory (unless swapped out in a queue). This very quickly results in massive 
amounts of garbage collection and OutOfMemoryErrors.
   - They also have to be written out to the FlowFile Repository when any 
attribute changes. If you change the filename, for instance, the next update to 
the FlowFile repository writes out *all* attributes. Now, if we have an 8 KB 
XML message in an attribute, we don't write 200 bytes, we write 8 KB. 
Potentially every time the FlowFile moves from 1 queue to the next.
   - We also have to write the attributes for every single Provenance event. So 
we have doubled the expense just mentioned.
   - If the data is swapped out, we get to remove it from heap. But that 
happens by default at 20,000 FlowFiles and only in chunks of 10,000 FlowFiles. 
So we need to buffer about 30,000 FlowFiles *per FlowFile Queue*!
   - And if we swap it out, we then get to drop those attributes from the heap, 
but we have to write out that big XML chunk *again*. And we have to read it in 
and deserialize it again when swapping back in. Performance will suffer greatly.
   
   But out of all of these concerns, the greatest, by far, is the heap 
exhaustion and OutOfMemoryErrors that get generated. When a user buffers a 
bunch of data into FlowFile attributes, they can significantly affect all users 
of the system. Once you buffer so much that you start throwing 
OutOfMemoryErrors, you're now into an area where you can potentially even start 
to affect the stability of the cluster because nodes are constantly performing 
garbage collection, etc.
   
   Because of this, we should very much discourage putting XML, JSON, or any 
other large chunks of content into FlowFile attributes. So I have trouble 
supporting any update that further encourages this anti-pattern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] markap14 commented on pull request #5324: NIFI-9072: improvements to ValidateXML including validate XML in attr…

Reply via email to